# **Capstone Project**
## EDA and Feature enginiering

 - *This notebook is deticated to Analysis the dataset and create tables to use in the Recommendations and the models.* 

## Starting work: Presenting the Data Sets

All the data used is contained in three files:

* **portfolio.json** - containing offer ids and meta data about each offer (duration, type, etc.)
* **profile.json** - demographic data for each customer
* **transcript.json** - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

--- 

# Importing Libraries and Loading Data

Importing the necessary Python libraries and loads data from JSON files into Pandas DataFrames.

### Libraries Imported
- `pandas` (`pd`): Used for data manipulation and analysis.
- `numpy` (`np`): Provides support for numerical operations and vectorized computation.

### Loading Data
The `pd.read_json()` function is used to read JSON files:
- `portfolio_raw`: Contains data from `portfolio.json`, likely representing promotional offers.
- `profile_raw`: Contains data from `profile.json`, likely storing user demographic information.
- `transcript_raw`: Contains data from `transcript.json`, likely recording user interactions or transactions.

Each file is read with `orient='records'` and `lines=True`, ensuring that each JSON object in the file is interpreted as a separate record (suitable for line-delimited JSON files).


In [1]:
# importing libraries
import pandas as pd
import numpy as np

# read in the json files
portfolio_raw = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile_raw = pd.read_json('data/profile.json', orient='records', lines=True)
transcript_raw = pd.read_json('data/transcript.json', orient='records', lines=True)

# Data understanding

--- 
Objectives: 

* Examination of each individual table and its corresponding columns.
* Exploratory data analysis (EDA) with some statistics.

> note: It should be noted that the data is merely to be known in its current state, without undergoing any processing. 
---


>>  **portfolio.json:**

This dataset contains ten offer types and their attributes. There are no other questions to be answered.

 - Data is cleaned and ready to be used

In [2]:
portfolio_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 612.0+ bytes


In [3]:
# showing the entire table
portfolio_raw

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5


>> **profile.json**

This dataset contains customer demographic data, including 17,000 customers.

 - Of these, 2,175 (~12.7%) have NoneType values for both gender and income, with the corresponding age values set to 118.
  - The data indicates that approximately 50% of customers are male, 36% female, and 1.2% of customers identify as 'O' type.
 - It is also noted that the column titled 'became_member_on' contains non-formatted date values.

In [4]:
profile_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB


In [5]:
profile_raw.describe(include='all') 

Unnamed: 0,gender,age,id,became_member_on,income
count,14825,17000.0,17000,17000.0,14825.0
unique,3,,17000,,
top,M,,e4052622e5ba45a8b96b59aba68cf068,,
freq,8484,,1,,
mean,,62.531412,,20167030.0,65404.991568
std,,26.73858,,11677.5,21598.29941
min,,18.0,,20130730.0,30000.0
25%,,45.0,,20160530.0,49000.0
50%,,58.0,,20170800.0,64000.0
75%,,73.0,,20171230.0,80000.0


In [6]:
profile_raw['gender'].value_counts(dropna=False, normalize=True)

gender
M       0.499059
F       0.360529
None    0.127941
O       0.012471
Name: proportion, dtype: float64

In [7]:
profile_raw['age'].value_counts(dropna=False, normalize=True)

age
118    0.127941
58     0.024000
53     0.021882
51     0.021353
59     0.021118
         ...   
100    0.000706
96     0.000471
98     0.000294
101    0.000294
99     0.000294
Name: proportion, Length: 85, dtype: float64

In [8]:
profile_raw[profile_raw['gender'] == 'O']

Unnamed: 0,gender,age,id,became_member_on,income
31,O,53,d1ede868e29245ea91818a903fec04c6,20170916,52000.0
273,O,60,d0be9ff460964c3398a33ad9b2829f3a,20180216,94000.0
383,O,49,0d0a9ca9281248a8a35806c9ae68f872,20171207,42000.0
513,O,63,01f46a5191424005af436cdf48a5da7c,20150920,89000.0
576,O,73,644ac06dc9b34a5bbd237a465cf47571,20180316,88000.0
...,...,...,...,...,...
16670,O,76,e8926849bbe24ce488d4f3fcd3b537e8,20180320,52000.0
16683,O,49,1f68e9b6850f49348235a281a47d9f15,20170607,56000.0
16731,O,51,a97208c5be42445d9949e82e0f70f622,20160707,55000.0
16741,O,56,994b6ef7a8ca46e3b379518399f6ec93,20180221,52000.0


>> **transcript.json**

 - It is observed that there is an inconsistency in the dictionary keys present in the 'value' column.
 - The word 'offer' in the 'event' column could be removed, with the objective of simplifying the categorisation name.
 - It is possible to separate two main data sets from this dataset: the events related to the offer events and the transactions related to purchases.
 - There are no missing values.
 
 - It should be noted that the column event contains two dictionarie types: dictionaries with offer ids, and dictionaries with the amount spent in the transaction event. 

In [9]:
transcript_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB


In [10]:
# showing just the dictionaries key value in the 'value' column
{[*x][0] for x in transcript_raw['value']}

{'amount', 'offer id', 'offer_id'}

In [11]:
transcript_raw['event'].value_counts()

event
transaction        138953
offer received      76277
offer viewed        57725
offer completed     33579
Name: count, dtype: int64

In [12]:
transcript_raw['time'].describe()

count    306534.000000
mean        366.382940
std         200.326314
min           0.000000
25%         186.000000
50%         408.000000
75%         528.000000
max         714.000000
Name: time, dtype: float64

In [13]:
transcript_raw

Unnamed: 0,person,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},0
4,68617ca6246f4fbc85e91a2a49552598,offer received,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},0
...,...,...,...,...
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,{'amount': 1.5899999999999999},714
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,{'amount': 9.53},714
306531,a00058cf10334a308c68e7631c529907,transaction,{'amount': 3.61},714
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,{'amount': 3.5300000000000002},714


# Data Preparation
---
- Objective:

Creation of `Analytical Tables` datasets for analisys.

- Strategy:
1. Loading raw data from the original tables.
2. Clean, normalise and transform some variables.
3. Restructuring it using groupby/unstack and creating fetures.
4. Save the strutured data created in a data-store folder (meadalion_data_store).

This process ensures that the data tables and **features** is properly formatted, aggregated and cleaned and persisted for further analysis. 

---

#### **Portfolio Dataset**
- Just creating a new column with short `id` to make them more easier to read, and drop the old id column.

In [14]:
#creating a copy from the original dataframe
portfolio = portfolio_raw.copy() 

# renaming the columns using a dictionary
port_id = {
    'ae264e3637204a6fb9bb56bc8210ddfd': 'ofr_A',
    '4d5c57ea9a6940dd891ad53e9dbe8da0': 'ofr_B',
    '3f207df678b143eea3cee63160fa8bed': 'ofr_C',
    '9b98b8c7a33c4b65b9aebfe6a799e6d9': 'ofr_D',
    '0b1e1539f2cc45b7b9fa7c272da2e1d7': 'ofr_E',
    '2298d6c36e964ae4a3e7e9706d1fb8c2': 'ofr_F',
    'fafdcd668e3743c1bb461111dcafc2a4': 'ofr_G',
    '5a8bc65990b245e5a138643cd4eb9837': 'ofr_H',
    'f19421c1d4aa40978ebb69ca19b0e20d': 'ofr_I',
    '2906b810c7d4411798c6938adc9daaa5': 'ofr_J'
}

# mapping the id column
portfolio['ofr_id_short'] = portfolio['id'].map(port_id)

portfolio = portfolio.drop(columns=['id'])

# persist a csv file to the bronze folder
portfolio.to_csv('medalion_data_store/bronze/portfolio.csv', index=False)

portfolio

Unnamed: 0,reward,channels,difficulty,duration,offer_type,ofr_id_short
0,10,"[email, mobile, social]",10,7,bogo,ofr_A
1,10,"[web, email, mobile, social]",10,5,bogo,ofr_B
2,0,"[web, email, mobile]",0,4,informational,ofr_C
3,5,"[web, email, mobile]",5,7,bogo,ofr_D
4,5,"[web, email]",20,10,discount,ofr_E
5,3,"[web, email, mobile, social]",7,7,discount,ofr_F
6,2,"[web, email, mobile, social]",10,10,discount,ofr_G
7,0,"[email, mobile, social]",0,3,informational,ofr_H
8,5,"[web, email, mobile, social]",5,5,bogo,ofr_I
9,2,"[web, email, mobile]",10,7,discount,ofr_J


#### **Profile Dataset**
- Convert the `became_member_on` column to a standardized **datetime** format for consistency and easier analysis.
-  Create a new column with only the year and month of the date.
- Fill gender column `None` values with *gen_ukn* - (gender unknown)
- Crate age_group column using `pd.cut() `function.
- svae the data in data store.

In [15]:
# copy the raw data into a new dataframe
profile = profile_raw.copy(deep=True)

# Convert the 'became_member_on' column to a datetime format
profile['became_member_on'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d')

# Create a new column with only the year and month of the membership
profile['bec_memb_year_month'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d').dt.strftime('%Y-%m')

profile['gender'] = profile['gender'].fillna('gen_ukn')

# appling age categoization     
profile['age_group'] = (pd.cut(
    profile['age'],
    bins=[-1, 25, 45, 65, 118],
    labels=['Young', 'Adult', 'Middle', 'Senior'], 
    include_lowest=True  
    ))

# Persist a csv file to the data store
profile.to_csv('medalion_data_store/bronze/profile.csv', index=False)

profile.head()

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
0,gen_ukn,118,68be06ca386d4c31939f3a4f0e3dd783,2017-02-12,,2017-02,Senior
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,2017-07,Middle
2,gen_ukn,118,38fe809add3b4fcf9315a9694bb96ff5,2018-07-12,,2018-07,Senior
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,2017-05,Senior
4,gen_ukn,118,a03223e636434f42ac4c3df47e8bac43,2017-08-04,,2017-08,Senior


#### **Transcript dataset**
- Clean, normalizing and transforming column values from the transcript orighinal table.
- Creating a new table `transcript_b` using `json_normalize()` and `concat()` methods to extract the dictionaries to new columns.

**Strategy:**

1. Copy the data.
2. Normalizing the `value` column dictionarie keys `offer id` --> `offer_id`.
3. Creating the `transcript_b` table using `pd.json_normalize()` and `pd.concat()` functions.
4. Normalizing the `event` column values removing the word `offer`. 
5. Creating `ofr_id_short` column with an id more readeble and droping the old `offer_id` column. Fill na with 'tran'.
6. Creating a `tag` column to identify the person-event-offer and the transactions interactions sequence one by one. (as a person can interact with the same offer type more than once and make several transactions)
7. Persist the table in a csv file and save in the data store.

In [16]:
def fix_offer_id(value):
    """
    Fixes the 'offer id' key in a dictionary by renaming it to 'offer_id'.

    Parameters:
    value (dict): A dictionary that may contain the 'offer id' key.

    Returns:
    dict: The updated dictionary with 'offer id' replaced by 'offer_id'.
    """
    if isinstance(value, dict) and 'offer id' in value:
        value['offer_id'] = value.pop('offer id')
    return value

In [17]:
# copy the raw data into a new dataframe
transcript = transcript_raw.copy(deep=True)


# appling the fix offer function
transcript['value'] = transcript['value'].apply(fix_offer_id)

# Normalize the 'value' column with json_normalize method
value_df = pd.json_normalize(transcript['value']) 
transcript_b = pd.concat([transcript, value_df], axis=1).drop('value', axis=1)

# Normalizing the event column categorie's names
transcript_b['event'] = [x.split(' ')[1] if len(x.split(' ')) > 1 else x for x in transcript_b['event']]

# mapping the offer_id to the offer_id_short and Dropping the offer_id column
transcript_b['ofr_id_short'] = transcript_b['offer_id'].map(port_id).fillna('tran')
transcript_b = transcript_b.drop(columns = ['offer_id'])

# creating a tag column to identify the order of the events for each person-offer-event fact
transcript_b['tag'] = (
    transcript_b.groupby(['person', 'ofr_id_short', 'event'], observed=True)
    .cumcount()
)

# persist a csv file to the bronze folder
transcript_b.to_csv('medalion_data_store/bronze/transcript_b.csv', index=False)

In [18]:
transcript_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   person        306534 non-null  object 
 1   event         306534 non-null  object 
 2   time          306534 non-null  int64  
 3   amount        138953 non-null  float64
 4   reward        33579 non-null   float64
 5   ofr_id_short  306534 non-null  object 
 6   tag           306534 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 16.4+ MB


In [19]:
transcript_b.describe(include='all')

Unnamed: 0,person,event,time,amount,reward,ofr_id_short,tag
count,306534,306534,306534.0,138953.0,33579.0,306534,306534.0
unique,17000,4,,,,11,
top,94de646f7b6041228ca7dec82adb97d2,transaction,,,,tran,
freq,51,138953,,,,138953,
mean,,,366.38294,12.777356,4.904137,,2.445487
std,,,200.326314,30.250529,2.886647,,3.973985
min,,,0.0,0.05,2.0,,0.0
25%,,,186.0,2.78,2.0,,0.0
50%,,,408.0,8.89,5.0,,0.0
75%,,,528.0,18.07,5.0,,4.0


In [20]:
transcript_b

Unnamed: 0,person,event,time,amount,reward,ofr_id_short,tag
0,78afa995795e4d85b5d9ceeca43f5fef,received,0,,,ofr_D,0
1,a03223e636434f42ac4c3df47e8bac43,received,0,,,ofr_E,0
2,e2127556f4f64592b11af22de27a7932,received,0,,,ofr_J,0
3,8ec6ce2a7e7949b1bf142def7d0e0586,received,0,,,ofr_G,0
4,68617ca6246f4fbc85e91a2a49552598,received,0,,,ofr_B,0
...,...,...,...,...,...,...,...
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,714,1.59,,tran,13
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,714,9.53,,tran,1
306531,a00058cf10334a308c68e7631c529907,transaction,714,3.61,,tran,19
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,714,3.53,,tran,12


### Separating *transactions* events in transcript_b dataset from the others events

* **Source**: 

    `transcript_b`.

* **Created tables:**

    `events` and `transactions`: separated data from `transcript_b`


- creating `events` and `transactions` separeted dataframes.
- Creating `time_diff` from `time` column using .diff() function for each person in `event` and `transactions` table.
- drop unnessessary columns 
- Save the data in the data store

> note: The creation of the time_diff variable is achieved by employing the .diff() function on the time column, with each individual in the event and transactions tables being processed. It is imperative to note that the initial NaN values are retained due to the imputation of any numerical value resulting in undesirable mathematical complications, which are more pronounced than the removal of NaN values.

In [21]:
# filtering  the data event / transactions
events = transcript_b[transcript_b['event'] != 'transaction'].copy()
transactions = transcript_b[transcript_b['event'] == 'transaction'].copy()

# drop the 'amount' column as it contais zero for all rows.
events = events.drop(columns=['amount']) 

# drop the reward (all zeros) and offer_id columns as they are not relevant in this dataset. 
transactions = transactions.drop(columns=['reward',	'ofr_id_short']) 

# Sorting table by time
events = events.sort_values(by=['person', 'time'])
transactions = transactions.sort_values(by=['person', 'time'])

events['time_diff'] = events.groupby(['person', 'event'])['time'].diff() #.fillna(0)
transactions['time_diff'] = transactions.groupby(['person'])['time'].diff() #.fillna(0)

# drop unessessary column (one unique value)
transactions = transactions.drop(columns=['event'])

#creating week and day time columns
transactions['week'] = np.ceil(transactions['time'] / (24*7))
transactions['day'] = np.ceil(transactions['time'] / 24)

# persisting the data in the silver layer
events.to_csv('medalion_data_store/bronze/events.csv', index=False)
transactions.to_csv('medalion_data_store/bronze/transactions.csv', index=False)

In [22]:
transactions

Unnamed: 0,person,time,amount,tag,time_diff,week,day
89291,0009655768c64bdeb2e877511632db8f,228,22.16,0,,2.0,10.0
168412,0009655768c64bdeb2e877511632db8f,414,8.57,1,186.0,3.0,18.0
228422,0009655768c64bdeb2e877511632db8f,528,14.11,2,114.0,4.0,22.0
237784,0009655768c64bdeb2e877511632db8f,552,13.56,3,24.0,4.0,23.0
258883,0009655768c64bdeb2e877511632db8f,576,10.27,4,24.0,4.0,24.0
...,...,...,...,...,...,...,...
200255,ffff82501cea40309d5fdd7edcca4a07,498,13.17,10,84.0,3.0,21.0
214716,ffff82501cea40309d5fdd7edcca4a07,504,7.79,11,6.0,3.0,21.0
258361,ffff82501cea40309d5fdd7edcca4a07,576,14.23,12,72.0,4.0,24.0
274809,ffff82501cea40309d5fdd7edcca4a07,606,10.12,13,30.0,4.0,26.0


In [23]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 138953 entries, 89291 to 289924
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   person     138953 non-null  object 
 1   time       138953 non-null  int64  
 2   amount     138953 non-null  float64
 3   tag        138953 non-null  int64  
 4   time_diff  122375 non-null  float64
 5   week       138953 non-null  float64
 6   day        138953 non-null  float64
dtypes: float64(4), int64(2), object(1)
memory usage: 8.5+ MB


In [24]:
events.info()

<class 'pandas.core.frame.DataFrame'>
Index: 167581 entries, 55972 to 262475
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   person        167581 non-null  object 
 1   event         167581 non-null  object 
 2   time          167581 non-null  int64  
 3   reward        33579 non-null   float64
 4   ofr_id_short  167581 non-null  object 
 5   tag           167581 non-null  int64  
 6   time_diff     120979 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 10.2+ MB


 - folowing a selected customer

In [25]:
events[(events['person'] == '0009655768c64bdeb2e877511632db8f') & (events['event'] == 'received')]

Unnamed: 0,person,event,time,reward,ofr_id_short,tag,time_diff
55972,0009655768c64bdeb2e877511632db8f,received,168,,ofr_H,0,
113605,0009655768c64bdeb2e877511632db8f,received,336,,ofr_C,0,168.0
153401,0009655768c64bdeb2e877511632db8f,received,408,,ofr_I,0,72.0
204340,0009655768c64bdeb2e877511632db8f,received,504,,ofr_G,0,96.0
247879,0009655768c64bdeb2e877511632db8f,received,576,,ofr_J,0,72.0


In [26]:
events[(events['person'] == '0009655768c64bdeb2e877511632db8f') & (events['event'] == 'viewed')]

Unnamed: 0,person,event,time,reward,ofr_id_short,tag,time_diff
77705,0009655768c64bdeb2e877511632db8f,viewed,192,,ofr_H,0,
139992,0009655768c64bdeb2e877511632db8f,viewed,372,,ofr_C,0,180.0
187554,0009655768c64bdeb2e877511632db8f,viewed,456,,ofr_I,0,84.0
233413,0009655768c64bdeb2e877511632db8f,viewed,540,,ofr_G,0,84.0


In [27]:
events[(events['person'] == '0009655768c64bdeb2e877511632db8f') & (events['event'] == 'completed')]

Unnamed: 0,person,event,time,reward,ofr_id_short,tag,time_diff
168413,0009655768c64bdeb2e877511632db8f,completed,414,5.0,ofr_I,0,
228423,0009655768c64bdeb2e877511632db8f,completed,528,2.0,ofr_G,0,114.0
258884,0009655768c64bdeb2e877511632db8f,completed,576,2.0,ofr_J,0,48.0


In [28]:
transactions[(transactions['person'] == '0009655768c64bdeb2e877511632db8f')]

Unnamed: 0,person,time,amount,tag,time_diff,week,day
89291,0009655768c64bdeb2e877511632db8f,228,22.16,0,,2.0,10.0
168412,0009655768c64bdeb2e877511632db8f,414,8.57,1,186.0,3.0,18.0
228422,0009655768c64bdeb2e877511632db8f,528,14.11,2,114.0,4.0,22.0
237784,0009655768c64bdeb2e877511632db8f,552,13.56,3,24.0,4.0,23.0
258883,0009655768c64bdeb2e877511632db8f,576,10.27,4,24.0,4.0,24.0
293497,0009655768c64bdeb2e877511632db8f,660,12.36,5,84.0,4.0,28.0
300930,0009655768c64bdeb2e877511632db8f,690,28.16,6,30.0,5.0,29.0
302205,0009655768c64bdeb2e877511632db8f,696,18.41,7,6.0,5.0,29.0


### Tracking custumers

Explore the tables that track customers to see 'where they are':
  - There are 16578 customers in the transactions table and 16994 in the events table.
  - 97.5% made transactions and 99.9% got events.
  - Six customers that are in the profile table (less 0.1%) did not receive offers (not in events table) but made transactions.
  
  Potential custumers:
  - 422 (2,5%) customers made no transactions (not in the transactions table) but received offers.
  - More relevant segment in the potential customers (44,7%) : Gendre: `Male`, member since: `2018-03`, age group: `Senior` of the 422 customers.


In [29]:
len({x for x in transactions['person']})

16578

In [30]:
len({x for x in events['person']})

16994

In [31]:
len((set(transactions['person']) - set(events['person'])))

6

In [32]:
# customers that did not received offers, but did transactions.
profile.loc[profile['id'].isin(list((set(transactions['person']) - set(events['person'])))), :]

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
872,F,72,c6e579c6821c41d1a7a6a9cf936e91bb,2017-10-14,35000.0,2017-10,Senior
5425,gen_ukn,118,da7a7c0dcfcb41a8acc7864a53cf60fb,2017-08-01,,2017-08,Senior
5639,F,66,eb540099db834cf59001f83a4561aef3,2017-09-29,34000.0,2017-09,Senior
6789,F,55,3a4874d8f0ef42b9a1b72294902afea9,2016-08-16,88000.0,2016-08,Middle
14763,F,54,ae8111e7e8cd4b60a8d35c42c1110555,2017-01-06,72000.0,2017-01,Middle
15391,M,91,12ede229379747bd8d74ccdc20097ca3,2015-10-05,70000.0,2015-10,Senior


In [33]:
# customers tht never made transactions but received offers.
len((set(events['person']) - set(transactions['person'])))

422

In [34]:
profile.loc[profile['id'].isin(
    list(

        (set(events['person']) - set(transactions['person']))
  
        )), :].describe(include='all')

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
count,422,422.0,422,422,333.0,422,422
unique,4,,422,,,33,4
top,M,,b6f74fc8e1664cfb9b44834dd9f7cf48,,,2018-03,Senior
freq,189,,1,,,42,188
mean,,70.810427,,2017-12-11 19:44:04.549763072,73537.537538,,
min,,18.0,,2013-11-23 00:00:00,31000.0,,
25%,,51.0,,2017-10-06 06:00:00,58000.0,,
50%,,62.5,,2018-01-21 00:00:00,73000.0,,
75%,,82.75,,2018-04-23 18:00:00,89000.0,,
max,,118.0,,2018-07-26 00:00:00,119000.0,,


In [35]:
profile.loc[profile['id'].isin(list((set(events['person']) - set(transactions['person'])))), :].isna().sum()

gender                  0
age                     0
id                      0
became_member_on        0
income                 89
bec_memb_year_month     0
age_group               0
dtype: int64

In [36]:
profile.isna().sum()

gender                    0
age                       0
id                        0
became_member_on          0
income                 2175
bec_memb_year_month       0
age_group                 0
dtype: int64

In [37]:
profile

Unnamed: 0,gender,age,id,became_member_on,income,bec_memb_year_month,age_group
0,gen_ukn,118,68be06ca386d4c31939f3a4f0e3dd783,2017-02-12,,2017-02,Senior
1,F,55,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,2017-07,Middle
2,gen_ukn,118,38fe809add3b4fcf9315a9694bb96ff5,2018-07-12,,2018-07,Senior
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,2017-05,Senior
4,gen_ukn,118,a03223e636434f42ac4c3df47e8bac43,2017-08-04,,2017-08,Senior
...,...,...,...,...,...,...,...
16995,F,45,6d5f3a774f3d4714ab0c092238f3a1d7,2018-06-04,54000.0,2018-06,Adult
16996,M,61,2cb4f97358b841b9a9773a7aa05a9d77,2018-07-13,72000.0,2018-07,Middle
16997,M,49,01d26f638c274aa0b965d24cefe3183f,2017-01-26,73000.0,2017-01,Middle
16998,F,83,9dc1421481194dcd9400aec7c9ae6366,2016-03-07,50000.0,2016-03,Senior


In [38]:
portfolio

Unnamed: 0,reward,channels,difficulty,duration,offer_type,ofr_id_short
0,10,"[email, mobile, social]",10,7,bogo,ofr_A
1,10,"[web, email, mobile, social]",10,5,bogo,ofr_B
2,0,"[web, email, mobile]",0,4,informational,ofr_C
3,5,"[web, email, mobile]",5,7,bogo,ofr_D
4,5,"[web, email]",20,10,discount,ofr_E
5,3,"[web, email, mobile, social]",7,7,discount,ofr_F
6,2,"[web, email, mobile, social]",10,10,discount,ofr_G
7,0,"[email, mobile, social]",0,3,informational,ofr_H
8,5,"[web, email, mobile, social]",5,5,bogo,ofr_I
9,2,"[web, email, mobile]",10,7,discount,ofr_J


In [39]:
events

Unnamed: 0,person,event,time,reward,ofr_id_short,tag,time_diff
55972,0009655768c64bdeb2e877511632db8f,received,168,,ofr_H,0,
77705,0009655768c64bdeb2e877511632db8f,viewed,192,,ofr_H,0,
113605,0009655768c64bdeb2e877511632db8f,received,336,,ofr_C,0,168.0
139992,0009655768c64bdeb2e877511632db8f,viewed,372,,ofr_C,0,180.0
153401,0009655768c64bdeb2e877511632db8f,received,408,,ofr_I,0,72.0
...,...,...,...,...,...,...,...
214717,ffff82501cea40309d5fdd7edcca4a07,completed,504,5.0,ofr_D,0,90.0
230690,ffff82501cea40309d5fdd7edcca4a07,viewed,534,,ofr_D,0,120.0
246495,ffff82501cea40309d5fdd7edcca4a07,received,576,,ofr_J,2,72.0
258362,ffff82501cea40309d5fdd7edcca4a07,completed,576,2.0,ofr_J,2,72.0


In [40]:
transactions

Unnamed: 0,person,time,amount,tag,time_diff,week,day
89291,0009655768c64bdeb2e877511632db8f,228,22.16,0,,2.0,10.0
168412,0009655768c64bdeb2e877511632db8f,414,8.57,1,186.0,3.0,18.0
228422,0009655768c64bdeb2e877511632db8f,528,14.11,2,114.0,4.0,22.0
237784,0009655768c64bdeb2e877511632db8f,552,13.56,3,24.0,4.0,23.0
258883,0009655768c64bdeb2e877511632db8f,576,10.27,4,24.0,4.0,24.0
...,...,...,...,...,...,...,...
200255,ffff82501cea40309d5fdd7edcca4a07,498,13.17,10,84.0,3.0,21.0
214716,ffff82501cea40309d5fdd7edcca4a07,504,7.79,11,6.0,3.0,21.0
258361,ffff82501cea40309d5fdd7edcca4a07,576,14.23,12,72.0,4.0,24.0
274809,ffff82501cea40309d5fdd7edcca4a07,606,10.12,13,30.0,4.0,26.0


# **Features Engineering**

- Process the datasets to create useful tables and extract fetures from the data.

## Sorted offers table:

Organizing offers sequence by person-event creating a summary of the person-offer events in the order in which they accur for each user.

In [41]:
sorted_offers = events.pivot_table(index=['person'], columns=['event'], values=['ofr_id_short'], aggfunc= lambda x: ' > '.join(x)).reset_index()

sorted_offers.columns = ['_'.join(col).strip('_') for col in sorted_offers.columns.to_flat_index()]

sorted_offers['first_completed'] = [x.split(' > ')[0] if type(x) == str else x for x in sorted_offers['ofr_id_short_completed']]
sorted_offers['last_completed'] = [x.split(' > ')[-1] if type(x) == str else x for x in sorted_offers['ofr_id_short_completed']]


sorted_offers['ofr_id_short_completed'] = sorted_offers['ofr_id_short_completed'].fillna('no_ofr_comp')
sorted_offers['ofr_id_short_received'] = sorted_offers['ofr_id_short_received'].fillna('no_ofr_rec')
sorted_offers['ofr_id_short_viewed'] = sorted_offers['ofr_id_short_viewed'].fillna('no_ofr_view')

sorted_offers = sorted_offers.fillna('no_ofr_comp')

# pd.get_dummies(sorted_offers, columns=['ofr_id_short_completed'], drop_first=True, prefix_sep= ' > ')

sorted_offers.to_csv('medalion_data_store/silver/sorted_offers.csv', index=False)

sorted_offers

Unnamed: 0,person,ofr_id_short_completed,ofr_id_short_received,ofr_id_short_viewed,first_completed,last_completed
0,0009655768c64bdeb2e877511632db8f,ofr_I > ofr_G > ofr_J,ofr_H > ofr_C > ofr_I > ofr_G > ofr_J,ofr_H > ofr_C > ofr_I > ofr_G,ofr_I,ofr_J
1,00116118485d4dfda04fdbaba9a87b5c,no_ofr_comp,ofr_I > ofr_I,ofr_I > ofr_I,no_ofr_comp,no_ofr_comp
2,0011e0d4e6b944f998e987f904e8c1e5,ofr_F > ofr_E > ofr_D,ofr_C > ofr_F > ofr_H > ofr_E > ofr_D,ofr_C > ofr_F > ofr_H > ofr_E > ofr_D,ofr_F,ofr_D
3,0020c2b971eb4e9188eac86d93036a77,ofr_G > ofr_G > ofr_B,ofr_G > ofr_A > ofr_G > ofr_B > ofr_H,ofr_G > ofr_B > ofr_H,ofr_G,ofr_B
4,0020ccbbb6d84e358d3414a3ff76cffd,ofr_F > ofr_I > ofr_D,ofr_F > ofr_I > ofr_H > ofr_D,ofr_F > ofr_I > ofr_H > ofr_D,ofr_F,ofr_D
...,...,...,...,...,...,...
16989,fff3ba4757bd42088c044ca26d73817a,ofr_G > ofr_D > ofr_J,ofr_G > ofr_D > ofr_H > ofr_J > ofr_H > ofr_J,ofr_G > ofr_D > ofr_H,ofr_G,ofr_J
16990,fff7576017104bcc8677a8d63322b5e1,ofr_G > ofr_G > ofr_D,ofr_G > ofr_B > ofr_A > ofr_G > ofr_D,ofr_G > ofr_B > ofr_A > ofr_G,ofr_G,ofr_D
16991,fff8957ea8b240a6b5e634b6ee8eafcf,no_ofr_comp,ofr_G > ofr_C > ofr_B,ofr_G > ofr_B,no_ofr_comp,no_ofr_comp
16992,fffad4f4828548d1b5583907f2e9906b,ofr_I > ofr_I > ofr_D,ofr_I > ofr_H > ofr_I > ofr_D,ofr_I > ofr_H > ofr_I > ofr_D,ofr_I,ofr_D


In [42]:
sorted_offers[sorted_offers['person'] == '0009655768c64bdeb2e877511632db8f']

Unnamed: 0,person,ofr_id_short_completed,ofr_id_short_received,ofr_id_short_viewed,first_completed,last_completed
0,0009655768c64bdeb2e877511632db8f,ofr_I > ofr_G > ofr_J,ofr_H > ofr_C > ofr_I > ofr_G > ofr_J,ofr_H > ofr_C > ofr_I > ofr_G,ofr_I,ofr_J


In [43]:
sorted_offers.isna().sum()

person                    0
ofr_id_short_completed    0
ofr_id_short_received     0
ofr_id_short_viewed       0
first_completed           0
last_completed            0
dtype: int64

## Grouping profiles with events table:

 - merging tables and group by segments
 - creating features in the prosses

 - merging

In [44]:
# merging evensts and profile
df = events.merge(profile.dropna(), how='left', left_on=['person'], right_on=['id']).drop(columns=['id'])
df = df.merge(portfolio, on=['ofr_id_short'], how='left')
df

Unnamed: 0,person,event,time,reward_x,ofr_id_short,tag,time_diff,gender,age,became_member_on,income,bec_memb_year_month,age_group,reward_y,channels,difficulty,duration,offer_type
0,0009655768c64bdeb2e877511632db8f,received,168,,ofr_H,0,,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[email, mobile, social]",0,3,informational
1,0009655768c64bdeb2e877511632db8f,viewed,192,,ofr_H,0,,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[email, mobile, social]",0,3,informational
2,0009655768c64bdeb2e877511632db8f,received,336,,ofr_C,0,168.0,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[web, email, mobile]",0,4,informational
3,0009655768c64bdeb2e877511632db8f,viewed,372,,ofr_C,0,180.0,M,33.0,2017-04-21,72000.0,2017-04,Adult,0,"[web, email, mobile]",0,4,informational
4,0009655768c64bdeb2e877511632db8f,received,408,,ofr_I,0,72.0,M,33.0,2017-04-21,72000.0,2017-04,Adult,5,"[web, email, mobile, social]",5,5,bogo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
167576,ffff82501cea40309d5fdd7edcca4a07,completed,504,5.0,ofr_D,0,90.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,5,"[web, email, mobile]",5,7,bogo
167577,ffff82501cea40309d5fdd7edcca4a07,viewed,534,,ofr_D,0,120.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,5,"[web, email, mobile]",5,7,bogo
167578,ffff82501cea40309d5fdd7edcca4a07,received,576,,ofr_J,2,72.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,2,"[web, email, mobile]",10,7,discount
167579,ffff82501cea40309d5fdd7edcca4a07,completed,576,2.0,ofr_J,2,72.0,F,45.0,2016-11-25,62000.0,2016-11,Adult,2,"[web, email, mobile]",10,7,discount


## Offer-Event-Time table:

 - Creating an **offer-event** table grouping previous df on `['ofr_id_short','offer_type']` and coutings on `ofr_id_short` plus creating ratio and `time` features

In [45]:
# Grouping the DataFrame by multiple columns and calculating the count and mean for specific columns
offer_event_time = df.groupby(['ofr_id_short', 'offer_type', 'difficulty', 'event']).agg(
    cnt=('ofr_id_short', 'count'),
    time_mean=('time', 'mean')
).unstack(level=[3]).reset_index()  # Reset index after unstacking

# Flattening the multi-level column index and removing extra underscores
offer_event_time.columns = ['_'.join(col).strip('_') for col in offer_event_time.columns.to_flat_index()]

# Calculating completion ratio (completed offers / received offers)
offer_event_time['comp_ratio'] = offer_event_time['cnt_completed'] / offer_event_time['cnt_received']

# Calculating view ratio (viewed offers / received offers)
offer_event_time['view_ratio'] = offer_event_time['cnt_viewed'] / offer_event_time['cnt_received']

# Calculating the intensity of completion and viewing
offer_event_time['comp-vie-intensity'] = offer_event_time['comp_ratio'] / offer_event_time['view_ratio']

# Filling NaN values in specific columns with 0
offer_event_time.iloc[:, 2:5] = offer_event_time.iloc[:, 2:5].fillna(0)

# Rounding numerical columns to 3 decimal places
offer_event_time = offer_event_time.round(3)

# Saving the result to a CSV file
offer_event_time.to_csv('medalion_data_store/silver/offer_event_time.csv', index=False)

# Return the final DataFrame
offer_event_time



Unnamed: 0,ofr_id_short,offer_type,difficulty,cnt_completed,cnt_received,cnt_viewed,time_mean_completed,time_mean_received,time_mean_viewed,comp_ratio,view_ratio,comp-vie-intensity
0,ofr_A,bogo,10,3688.0,7658.0,6716.0,394.767,329.76,352.622,0.482,0.877,0.549
1,ofr_B,bogo,10,3331.0,7593.0,7298.0,385.722,335.153,353.119,0.439,0.961,0.456
2,ofr_C,informational,0,0.0,7617.0,4144.0,,331.885,358.639,,0.544,
3,ofr_D,bogo,5,4354.0,7677.0,4171.0,407.051,334.146,361.977,0.567,0.543,1.044
4,ofr_E,discount,20,3420.0,7668.0,2663.0,431.549,331.336,366.748,0.446,0.347,1.284
5,ofr_F,discount,7,5156.0,7646.0,7337.0,400.318,336.377,354.75,0.674,0.96,0.703
6,ofr_G,discount,10,5317.0,7597.0,7327.0,399.117,330.487,348.868,0.7,0.964,0.726
7,ofr_H,informational,0,0.0,7618.0,6687.0,,332.475,353.934,,0.878,
8,ofr_I,bogo,5,4296.0,7571.0,7264.0,382.936,332.171,349.797,0.567,0.959,0.591
9,ofr_J,discount,10,4017.0,7632.0,4118.0,409.952,332.003,356.204,0.526,0.54,0.975


In [46]:
offer_event_time.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ofr_id_short         10 non-null     object 
 1   offer_type           10 non-null     object 
 2   difficulty           10 non-null     int64  
 3   cnt_completed        10 non-null     float64
 4   cnt_received         10 non-null     float64
 5   cnt_viewed           10 non-null     float64
 6   time_mean_completed  8 non-null      float64
 7   time_mean_received   10 non-null     float64
 8   time_mean_viewed     10 non-null     float64
 9   comp_ratio           8 non-null      float64
 10  view_ratio           10 non-null     float64
 11  comp-vie-intensity   8 non-null      float64
dtypes: float64(9), int64(1), object(2)
memory usage: 1.1+ KB


## Person-Event-Time Table:
 
 -  Grouping df on `'person'`, `'gender'`, `'age_group'`, `'event'`, `'ofr_id_short'`, `'tag'` and coutings on `'ofr_id_short'` plus creating time features

In [47]:
# Grouping the DataFrame by multiple columns and calculating the count and mean for specific columns
person_event_time = df.groupby(['person', 'gender', 'age_group', 'event', 'ofr_id_short', 'tag'], observed=True).agg(
cnt=('ofr_id_short', 'count'),
t = ('time', lambda x: x),

).unstack(level=[3]).reset_index() #.fillna(0)

# Flattening the multi-level column index and removing extra underscores
person_event_time.columns = ['_'.join(col).strip('_') for col in person_event_time.columns.to_flat_index()]

# fill na with apropriate value
person_event_time.iloc[:, 5:8] = person_event_time.iloc[:, 5:8].fillna(0)

# computing the stage of the fact: 1- just received, 2- received-viwed or received-completed, 3- all occurrences
person_event_time['stage'] = person_event_time[['cnt_completed',	'cnt_received',	'cnt_viewed']].sum(axis=1)

# Replacing NaN values with infinity to be concise with reality (inf = never end event)
person_event_time['t_completed'] = person_event_time['t_completed'].fillna(np.inf)
person_event_time['t_received'] = person_event_time['t_received'].fillna(np.inf)
person_event_time['t_viewed'] = person_event_time['t_viewed'].fillna(np.inf)

# Computing time differences for stages of the offer lifecycle
person_event_time['to_vr'] = person_event_time['t_viewed'] - person_event_time['t_received']
person_event_time['to_cv'] = person_event_time['t_completed'] - person_event_time['t_viewed']
person_event_time['to_cr'] = person_event_time['t_completed'] - person_event_time['t_received']

# person_event_time['to_cv'] = person_event_time['to_cv'].fillna(0)

# Calculating curiosity, eagerness, and overall responsiveness scores
person_event_time['curiosity_vr'] = (2 / (np.exp(person_event_time['to_vr']*0.08) + 1))
person_event_time['eagerness_cv'] = [2 / (np.exp(x*0.08) + 1) if not np.isnan(x) else 0 for x in person_event_time['to_cv']]
person_event_time['overall_cr'] = (2 / (np.exp(person_event_time['to_cr']*0.08) + 1))

# Defining influence metrics based on time conditions
person_event_time['influence'] = (
    (~pd.isna(person_event_time['t_completed'])) & 
    (person_event_time['t_completed'] != np.inf) &
    (person_event_time['t_completed'] > person_event_time['t_viewed'])
).astype(int)

# Extreme influence is when the offer is viewed and completed at the same time.
person_event_time['ext_influence'] = (
    (~pd.isna(person_event_time['t_viewed'])) & 
    (person_event_time['t_viewed'] != np.inf) & 
    (person_event_time['t_completed'] == person_event_time['t_viewed'])
).astype(int)


person_event_time = person_event_time.drop(columns=['tag'])

# ronding
person_event_time = person_event_time.round(3)

# saving
person_event_time.to_csv('medalion_data_store/silver/person_event_time.csv', index=False)

In [48]:
person_event_time

Unnamed: 0,person,gender,age_group,ofr_id_short,cnt_completed,cnt_received,cnt_viewed,t_completed,t_received,t_viewed,stage,to_vr,to_cv,to_cr,curiosity_vr,eagerness_cv,overall_cr,influence,ext_influence
0,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_C,0.0,1.0,1.0,inf,336.0,372.0,2.0,36.0,inf,inf,0.106,0.000,0.000,0,0
1,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_G,1.0,1.0,1.0,528.0,504.0,540.0,3.0,36.0,-12.0,24.0,0.106,1.446,0.256,0,0
2,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_H,0.0,1.0,1.0,inf,168.0,192.0,2.0,24.0,inf,inf,0.256,0.000,0.000,0,0
3,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_I,1.0,1.0,1.0,414.0,408.0,456.0,3.0,48.0,-42.0,6.0,0.042,1.933,0.765,0,0
4,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_J,1.0,1.0,0.0,576.0,576.0,inf,2.0,inf,-inf,0.0,0.000,2.000,1.000,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66496,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_E,1.0,1.0,1.0,198.0,168.0,174.0,3.0,6.0,24.0,30.0,0.765,0.256,0.166,1,0
66497,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_G,1.0,1.0,1.0,60.0,0.0,6.0,3.0,6.0,54.0,60.0,0.765,0.026,0.016,1,0
66498,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_J,1.0,1.0,1.0,384.0,336.0,354.0,3.0,18.0,30.0,48.0,0.383,0.166,0.042,1,0
66499,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_J,1.0,1.0,1.0,414.0,408.0,414.0,3.0,6.0,0.0,6.0,0.765,1.000,0.765,0,1


In [49]:
person_event_time.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66501 entries, 0 to 66500
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   person         66501 non-null  object  
 1   gender         66501 non-null  object  
 2   age_group      66501 non-null  category
 3   ofr_id_short   66501 non-null  object  
 4   cnt_completed  66501 non-null  float64 
 5   cnt_received   66501 non-null  float64 
 6   cnt_viewed     66501 non-null  float64 
 7   t_completed    66501 non-null  float64 
 8   t_received     66501 non-null  float64 
 9   t_viewed       66501 non-null  float64 
 10  stage          66501 non-null  float64 
 11  to_vr          66501 non-null  float64 
 12  to_cv          55383 non-null  float64 
 13  to_cr          66501 non-null  float64 
 14  curiosity_vr   66501 non-null  float64 
 15  eagerness_cv   66501 non-null  float64 
 16  overall_cr     66501 non-null  float64 
 17  influence      66501 non-null  

In [50]:
# Grouping the DataFrame by multiple columns and calculating the count and mean for specific columns
person_event_r_time = df.groupby(['person', 'gender', 'age_group', 'event', 'ofr_id_short'], observed=True).agg(
cnt=('ofr_id_short', 'count'),

).unstack(level=[3,4]).reset_index() #.fillna(0)

# Flattening the multi-level column index and removing extra underscores
person_event_r_time.columns = ['_'.join(col).strip('_') for col in person_event_r_time.columns.to_flat_index()]

person_event_r_time.iloc[:,3:] = person_event_r_time.iloc[:,3:].round(3).fillna(0)


# saving
person_event_r_time.to_csv('medalion_data_store/silver/person_event_r_time.csv', index=False)

In [51]:
person_event_r_time.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14820 entries, 0 to 14819
Data columns (total 31 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   person               14820 non-null  object  
 1   gender               14820 non-null  object  
 2   age_group            14820 non-null  category
 3   cnt_completed_ofr_G  14820 non-null  float64 
 4   cnt_completed_ofr_I  14820 non-null  float64 
 5   cnt_completed_ofr_J  14820 non-null  float64 
 6   cnt_received_ofr_C   14820 non-null  float64 
 7   cnt_received_ofr_G   14820 non-null  float64 
 8   cnt_received_ofr_H   14820 non-null  float64 
 9   cnt_received_ofr_I   14820 non-null  float64 
 10  cnt_received_ofr_J   14820 non-null  float64 
 11  cnt_viewed_ofr_C     14820 non-null  float64 
 12  cnt_viewed_ofr_G     14820 non-null  float64 
 13  cnt_viewed_ofr_H     14820 non-null  float64 
 14  cnt_viewed_ofr_I     14820 non-null  float64 
 15  cnt_completed_ofr_D

## Gender-Event-Time Table

 - grouping df on `['gender', 'age_group', 'ofr_id_short', 'event']` and creating features.

In [52]:
gender_event_time = df.groupby(['gender', 'age_group', 'ofr_id_short', 'event'], observed=True).agg( 
cnt=('event', 'count'),
t = ('time', 'mean'),

).unstack(level=[3]).reset_index() 

# Flattening the multi-level column index and removing extra underscores
gender_event_time.columns = ['_'.join(col).strip('_') for col in gender_event_time.columns.to_flat_index()]

# fill na with apropriate value
gender_event_time.iloc[:, 3:6] = gender_event_time.iloc[:, 3:6].fillna(0)


# computing the stage of the fact: 1- just received, 2- received-viwed or received-completed, 3- all occurrences
gender_event_time['stage'] = gender_event_time[['cnt_completed',	'cnt_received',	'cnt_viewed']].sum(axis=1)

# Replacing NaN values with infinity to be concise with reality (inf = never end event)
gender_event_time['t_completed'] = gender_event_time['t_completed'].fillna(np.inf)
gender_event_time['t_received'] = gender_event_time['t_received'].fillna(np.inf)
gender_event_time['t_viewed'] = gender_event_time['t_viewed'].fillna(np.inf)

# Computing time differences for stages of the offer lifecycle
gender_event_time['to_vr'] = gender_event_time['t_viewed'] - gender_event_time['t_received']
gender_event_time['to_cv'] = gender_event_time['t_completed'] - gender_event_time['t_viewed']
gender_event_time['to_cr'] = gender_event_time['t_completed'] - gender_event_time['t_received']

# gender_event_time['to_cv'] = gender_event_time['to_cv'].fillna(0)

# Calculating curiosity, eagerness, and overall responsiveness scores
gender_event_time['curiosity_vr'] = (2 / (np.exp(gender_event_time['to_vr']*0.08) + 1))
gender_event_time['eagerness_cv'] = [2 / (np.exp(x*0.08) + 1) if not np.isnan(x) else 0 for x in gender_event_time['to_cv']]
gender_event_time['overall_cr'] = (2 / (np.exp(gender_event_time['to_cr']*0.08) + 1))

# Defining influence metrics based on time conditions
gender_event_time['influence'] = (
    (~pd.isna(gender_event_time['t_completed'])) & 
    (gender_event_time['t_completed'] != np.inf) &
    (gender_event_time['t_completed'] > gender_event_time['t_viewed'])
).astype(int)

# Extreme influence is when the offer is viewed and completed at the same time.
gender_event_time['ext_influence'] = (
    (~pd.isna(gender_event_time['t_viewed'])) & 
    (gender_event_time['t_viewed'] != np.inf) & 
    (gender_event_time['t_completed'] == gender_event_time['t_viewed'])
).astype(int)

gender_event_time = gender_event_time.drop(columns=['t_completed', 't_received', 't_viewed', 'stage', 'to_vr', 'to_cv', 'to_cr'])

# ronding
gender_event_time = gender_event_time.round(3)

# saving
gender_event_time.to_csv('medalion_data_store/silver/gender_event_time.csv', index=False)

In [53]:
gender_event_time

Unnamed: 0,gender,age_group,ofr_id_short,cnt_completed,cnt_received,cnt_viewed,curiosity_vr,eagerness_cv,overall_cr,influence,ext_influence
0,F,Young,ofr_A,67.0,121.0,115.0,0.424,0.046,0.013,1,0
1,F,Young,ofr_B,62.0,135.0,130.0,0.300,0.372,0.077,1,0
2,F,Young,ofr_C,0.0,126.0,41.0,0.053,0.000,0.000,0,0
3,F,Young,ofr_D,78.0,121.0,53.0,0.150,0.249,0.023,1,0
4,F,Young,ofr_E,64.0,126.0,23.0,1.843,0.000,0.001,1,0
...,...,...,...,...,...,...,...,...,...,...,...
115,O,Senior,ofr_F,18.0,25.0,24.0,0.905,0.044,0.037,1,0
116,O,Senior,ofr_G,15.0,23.0,23.0,0.596,0.127,0.056,1,0
117,O,Senior,ofr_H,0.0,26.0,24.0,1.077,0.000,0.000,0,0
118,O,Senior,ofr_I,18.0,22.0,20.0,0.084,1.578,0.281,0,0


## User-Transactions Table:

 - A table that tracks transactions based on **person** columns. The aggregation process computes several statistics related to the **transactions amount**, and **time** for each user.

In [54]:
# Group transaction data by person and event, aggregating count and sum of amounts
user_transactions = transactions.groupby(['person']).agg(
    # counting block
    cnt_tran=('amount', 'count'),        
    
    # amount statistics block
    sum_am_tran=('amount', 'sum'),    
    mean_am_tran=('amount', 'mean'),   
    median_am_tran=('amount', 'median'), 
    min_am_tran=('amount', 'min'),         
    max_am_tran=('amount', 'max'),         
    small_tran_count = ('amount', lambda x: (x < x.quantile(0.25)).sum()), # n de transações pequenas
    big_tran_count = ('amount', lambda x: (x > x.quantile(0.75)).sum()), # n de transações grandes
    range_amount_tran=('amount', lambda x: x.max() - x.min()),

    # time satatistics clock
    mean_t_tran=('time', 'mean'),
    median_t_tran=('time', 'median'),
    min_t_tran=('time', 'min'),        
    max_t_tran=('time', 'max'),
    range_t_tran = ('time', lambda x: x.max() - x.min()),  
    freq_tran = ('time', lambda x: (len(x) / (x.max() - x.min()))*100 if x.max() != x.min() else 1),
    recency_tran = ('time', lambda x: (714 - x.max()/714))

).round(2).reset_index() #.fillna(0)

# Calculate the ratio of max transaction amount to the sum of transaction amounts
user_transactions['max_to_sum_am_tran'] = (
    user_transactions['max_am_tran'] / user_transactions['sum_am_tran']
)

# Calculate the ratio of median transaction amount to the mean transaction amount
user_transactions['median_to_mean_am_tran'] = (
    user_transactions['median_am_tran'] / user_transactions['mean_am_tran']
)

# Calculate the weekly transaction mean for each person
user_transactions['weekly_tran_mean'] = (
    transactions.groupby(['person', 'week']).size()  # Count transactions per person-week
    .groupby('person').mean().values  # Calculate mean transactions per person
)

# Calculate the weekly transaction minimum for each person
user_transactions['weekly_tran_min'] = (
    transactions.groupby(['person', 'week']).size()  # Count transactions per person-week
    .groupby('person').min().values  # Calculate minimum transactions per person
)

# Calculate the weekly transaction maximum for each person
user_transactions['weekly_tran_max'] = (
    transactions.groupby(['person', 'week']).size()  # Count transactions per person-week
    .groupby('person').max().values  # Calculate maximum transactions per person
)

# save the table in the data store
user_transactions.to_csv('medalion_data_store/silver/user_transactions.csv', index=False)

# Output final user-item-transactions matrix
user_transactions

Unnamed: 0,person,cnt_tran,sum_am_tran,mean_am_tran,median_am_tran,min_am_tran,max_am_tran,small_tran_count,big_tran_count,range_amount_tran,...,min_t_tran,max_t_tran,range_t_tran,freq_tran,recency_tran,max_to_sum_am_tran,median_to_mean_am_tran,weekly_tran_mean,weekly_tran_min,weekly_tran_max
0,0009655768c64bdeb2e877511632db8f,8,127.60,15.95,13.84,8.57,28.16,2,2,19.59,...,228,696,468,1.71,713.03,0.220690,0.867712,2.000000,1,4
1,00116118485d4dfda04fdbaba9a87b5c,3,4.09,1.36,0.70,0.20,3.19,1,1,2.99,...,294,474,180,1.67,713.34,0.779951,0.514706,1.500000,1,2
2,0011e0d4e6b944f998e987f904e8c1e5,5,79.46,15.89,13.49,8.96,23.03,1,1,14.07,...,132,654,522,0.96,713.08,0.289831,0.848962,1.666667,1,3
3,0020c2b971eb4e9188eac86d93036a77,8,196.86,24.61,24.35,17.24,33.86,2,2,16.62,...,54,708,654,1.22,713.01,0.172000,0.989435,2.666667,2,4
4,0020ccbbb6d84e358d3414a3ff76cffd,12,154.05,12.84,12.76,6.81,20.08,3,3,13.27,...,42,672,630,1.90,713.06,0.130347,0.993769,3.000000,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16573,fff3ba4757bd42088c044ca26d73817a,11,580.98,52.82,20.98,10.99,388.22,3,3,377.23,...,6,552,546,2.01,713.23,0.668216,0.397198,3.666667,3,5
16574,fff7576017104bcc8677a8d63322b5e1,6,29.94,4.99,5.03,2.08,8.01,2,2,5.93,...,36,696,660,0.91,713.03,0.267535,1.008016,1.500000,1,2
16575,fff8957ea8b240a6b5e634b6ee8eafcf,5,12.15,2.43,0.89,0.64,6.39,1,1,5.75,...,18,576,558,0.90,713.19,0.525926,0.366255,1.666667,1,3
16576,fffad4f4828548d1b5583907f2e9906b,12,88.83,7.40,7.52,2.05,12.18,3,3,10.13,...,36,678,642,1.87,713.05,0.137116,1.016216,2.400000,1,4


In [67]:
user_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16578 entries, 0 to 16577
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   person                  16578 non-null  object 
 1   cnt_tran                16578 non-null  int64  
 2   sum_am_tran             16578 non-null  float64
 3   mean_am_tran            16578 non-null  float64
 4   median_am_tran          16578 non-null  float64
 5   min_am_tran             16578 non-null  float64
 6   max_am_tran             16578 non-null  float64
 7   small_tran_count        16578 non-null  int64  
 8   big_tran_count          16578 non-null  int64  
 9   range_amount_tran       16578 non-null  float64
 10  mean_t_tran             16578 non-null  float64
 11  median_t_tran           16578 non-null  float64
 12  min_t_tran              16578 non-null  int64  
 13  max_t_tran              16578 non-null  int64  
 14  range_t_tran            16578 non-null

In [55]:
user_transactions.isna().sum()

person                    0
cnt_tran                  0
sum_am_tran               0
mean_am_tran              0
median_am_tran            0
min_am_tran               0
max_am_tran               0
small_tran_count          0
big_tran_count            0
range_amount_tran         0
mean_t_tran               0
median_t_tran             0
min_t_tran                0
max_t_tran                0
range_t_tran              0
freq_tran                 0
recency_tran              0
max_to_sum_am_tran        0
median_to_mean_am_tran    0
weekly_tran_mean          0
weekly_tran_min           0
weekly_tran_max           0
dtype: int64

## Profile-Transactions table:

 - A table that tracks transactions based on **person** columns merged with profile segments. The aggregation process computes several statistics related to the **transactions amount**, and **time** for each user.

In [56]:
# Merging profile data with transaction data using 'id' and 'person' as join keys
profile_transactions = profile.merge(transactions, left_on=['id'], right_on=['person'], how='left')

# Grouping transaction data by gender and age group, aggregating various statistics
gender_transactions = profile_transactions.groupby(['gender', 'age_group'], observed=True).agg(
    # Transaction count block
    cnt_tran=('amount', 'count'),  # Counting the total number of transactions
    
    # Transaction amount statistics block
    sum_am_tran=('amount', 'sum'),  # Summing the transaction amounts
    mean_am_tran=('amount', 'mean'),  # Calculating the mean transaction amount
    median_am_tran=('amount', 'median'),  # Calculating the median transaction amount
    min_am_tran=('amount', 'min'),  # Finding the minimum transaction amount
    max_am_tran=('amount', 'max'),  # Finding the maximum transaction amount
    
    # Counting small and large transactions based on quantiles
    small_tran_count=('amount', lambda x: (x < x.quantile(0.25)).sum()),  # Count of small transactions (below 25th percentile)
    big_tran_count=('amount', lambda x: (x > x.quantile(0.75)).sum()),  # Count of large transactions (above 75th percentile)
    
    # Calculating the range of transaction amounts
    range_amount_tran=('amount', lambda x: x.max() - x.min()),  # Range between max and min amounts
    
    # Time statistics block
    mean_t_tran=('time', 'mean'),  # Mean time of transactions
    median_t_tran=('time', 'median'),  # Median time of transactions
    min_t_tran=('time', 'min'),  # Minimum transaction time
    max_t_tran=('time', 'max'),  # Maximum transaction time
    
    # Calculating the time range (max time - min time)
    range_t_tran=('time', lambda x: x.max() - x.min()),  
    
    # Frequency of transactions: transactions per unit of time
    freq_tran=('time', lambda x: (len(x) / (x.max() - x.min())) * 100 if x.max() != x.min() else 1),
    
    # Recency of transactions: how recent the transactions are
    recency_tran=('time', lambda x: (714 - x.max() / 714))  # Adjusting based on the most recent time
).round(2).reset_index()  # Round values to 2 decimal places and reset the index

# Creating secondary features related to the transaction amounts
gender_transactions['max_to_sum_am_tran'] = (
    gender_transactions['max_am_tran'] / gender_transactions['sum_am_tran']
)  # Ratio of the maximum transaction amount to the total sum of amounts

gender_transactions['median_to_mean_am_tran'] = (
    gender_transactions['median_am_tran'] / gender_transactions['mean_am_tran']
)  # Ratio of the median transaction amount to the mean amount (indicates distribution skewness)

# Saving the result into a CSV file for further use
gender_transactions.to_csv('medalion_data_store/silver/profile_transactions.csv', index=False)

# Output the final DataFrame containing aggregated transaction data
gender_transactions


Unnamed: 0,gender,age_group,cnt_tran,sum_am_tran,mean_am_tran,median_am_tran,min_am_tran,max_am_tran,small_tran_count,big_tran_count,range_amount_tran,mean_t_tran,median_t_tran,min_t_tran,max_t_tran,range_t_tran,freq_tran,recency_tran,max_to_sum_am_tran,median_to_mean_am_tran
0,F,Young,3054,33221.0,10.88,8.16,0.05,624.47,761,764,624.42,369.59,390.0,0.0,714.0,714.0,427.87,713.0,0.018797,0.75
1,F,Adult,10579,138260.88,13.07,10.62,0.05,943.33,2636,2641,943.28,380.41,408.0,0.0,714.0,714.0,1484.45,713.0,0.006823,0.812548
2,F,Middle,21301,414751.9,19.47,17.25,0.05,1062.28,5322,5322,1062.23,380.43,402.0,0.0,714.0,714.0,2992.16,713.0,0.002561,0.885978
3,F,Senior,14448,277461.22,19.2,16.89,0.05,943.4,3606,3612,943.35,384.21,408.0,0.0,714.0,714.0,2030.81,713.0,0.0034,0.879688
4,M,Young,7274,51182.75,7.04,3.76,0.05,697.13,1817,1816,697.08,383.39,408.0,0.0,714.0,714.0,1019.33,713.0,0.01362,0.534091
5,M,Adult,20983,191437.16,9.12,5.19,0.05,845.01,5230,5235,844.96,379.69,396.0,0.0,714.0,714.0,2944.12,713.0,0.004414,0.569079
6,M,Middle,28459,376392.73,13.23,9.45,0.05,977.78,7102,7112,977.73,381.52,402.0,0.0,714.0,714.0,4000.14,713.0,0.002598,0.714286
7,M,Senior,16078,225878.22,14.05,10.38,0.05,961.21,4016,4020,961.16,384.01,408.0,0.0,714.0,714.0,2258.12,713.0,0.004255,0.73879
8,O,Young,107,1005.47,9.4,8.24,0.05,26.13,27,27,26.08,364.93,372.0,6.0,708.0,702.0,15.24,713.01,0.025988,0.876596
9,O,Adult,461,5047.32,10.95,9.4,0.07,77.44,115,115,77.37,363.93,390.0,0.0,714.0,714.0,64.71,713.0,0.015343,0.858447


## User-Transactions-Time Table:

Constructing a **Transaction-Time** based matrix from `transaction` table. The time frame was divided in days. The first period conteins all days. The second contains all - 1 days and so on. The sum of transactions per periods was computed. A churn column was proposed.

In [57]:
# Aggregate transaction counts per user over time
user_transactions_time = (
    transactions.groupby(['person', 'day'])
    .size()
    .unstack(level=1)
)

# Rename columns to indicate transaction counts over time
user_transactions_time.columns = [
    f'day_{int(col)}' for col in user_transactions_time.columns.to_flat_index()
]

# Compute transaction frequencies over days
frequences_period = pd.DataFrame({
    f'period_{30 - i}': user_transactions_time.iloc[:, i+1:].sum(axis=1)
    for i in range(0, 30)
})

# Define churn as 1 if no transactions occurred in the last three periods
churn = pd.Series((frequences_period.iloc[:, -10:].sum(axis=1) == 0).astype(int), name='churn')

# Compute recency: Time elapsed since last transaction (714 - max transaction time per person)
recency = (714 - transactions.groupby('person')['time'].max())
recency.name = 'recency'

# Alternative churn definition: 1 if recency is greater than 96 hours
churn2 = pd.Series((recency > 240).astype(int), name='churn2')

# Combine all features into a final churn prediction table
user_transactions_time = pd.concat([
        frequences_period, recency, churn, churn2
], axis=1)

user_transactions_time = user_transactions_time.reset_index()

# Save churn data to a CSV file
user_transactions_time.to_csv('medalion_data_store/silver/user_transactions_time.csv', index=False)

# Output the churn table
user_transactions_time


Unnamed: 0,person,period_30,period_29,period_28,period_27,period_26,period_25,period_24,period_23,period_22,...,period_7,period_6,period_5,period_4,period_3,period_2,period_1,recency,churn,churn2
0,0009655768c64bdeb2e877511632db8f,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,...,4.0,3.0,3.0,3.0,3.0,2.0,0.0,18,0,0
1,00116118485d4dfda04fdbaba9a87b5c,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,240,1,0
2,0011e0d4e6b944f998e987f904e8c1e5,5.0,5.0,5.0,5.0,5.0,5.0,4.0,4.0,4.0,...,3.0,2.0,2.0,2.0,1.0,0.0,0.0,60,0,0
3,0020c2b971eb4e9188eac86d93036a77,8.0,8.0,8.0,5.0,5.0,5.0,4.0,4.0,4.0,...,2.0,2.0,2.0,2.0,2.0,2.0,1.0,6,0,0
4,0020ccbbb6d84e358d3414a3ff76cffd,12.0,12.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,...,2.0,2.0,1.0,1.0,1.0,0.0,0.0,42,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16573,fff3ba4757bd42088c044ca26d73817a,11.0,9.0,9.0,8.0,7.0,7.0,7.0,6.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,162,0,0
16574,fff7576017104bcc8677a8d63322b5e1,6.0,6.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,...,3.0,2.0,1.0,1.0,1.0,1.0,0.0,18,0,0
16575,fff8957ea8b240a6b5e634b6ee8eafcf,5.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,138,0,0
16576,fffad4f4828548d1b5583907f2e9906b,12.0,12.0,11.0,10.0,9.0,8.0,8.0,8.0,8.0,...,3.0,2.0,1.0,1.0,1.0,1.0,0.0,36,0,0


In [58]:
user_transactions_time[user_transactions_time['churn'] == 1]

Unnamed: 0,person,period_30,period_29,period_28,period_27,period_26,period_25,period_24,period_23,period_22,...,period_7,period_6,period_5,period_4,period_3,period_2,period_1,recency,churn,churn2
1,00116118485d4dfda04fdbaba9a87b5c,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,240,1,0
11,0063def0f9c14bc4805322a488839b32,3.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,324,1,1
17,008d7088107b468893889da0ede0df5c,7.0,7.0,7.0,7.0,7.0,5.0,4.0,4.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,264,1,1
33,00bc983061d3471e8c8e74d31b7c8b6f,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,318,1,1
36,00c32a104f0c4065b5b552895fb22e34,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,330,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16525,ff23416f1f724ece8b81bae7acb279ed,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,276,1,1
16529,ff39b67d25cf48dd8fae9c8766f5eeed,4.0,4.0,4.0,3.0,3.0,2.0,2.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,372,1,1
16534,ff4c63a0550743c1a1a46d0f7b792186,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,240,1,0
16544,ff7cb44e72db4112b270560686f97a23,8.0,8.0,6.0,5.0,4.0,3.0,3.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,252,1,1


# Analytical tables 

A final table mergin selected tables created in the previous steps.

The goals is get different types of dataset to be used in the analysis, modeling and recommendations.

## User-Event-transactions

 - merging user-transactions on person_event_time getting the transactions standard customer's behavior in the same dataset.

In [59]:
person_event_time.shape

(66501, 19)

In [60]:
user_transactions.shape

(16578, 22)

In [61]:
# Use a left join to keep all records from user_item_event

user_event_transactions = person_event_time.merge(user_transactions, on=['person'], how='left')

# dropping columns that contains inf values and the tag helper column
user_event_transactions = user_event_transactions.drop(columns=['t_completed', 't_received', 't_viewed', 'stage', 'to_vr', 'to_cv', 'to_cr'])

# Save the final dataset to a CSV file
user_event_transactions.to_csv('medalion_data_store/gold/user_event_transactions.csv', index=False)


user_event_transactions

Unnamed: 0,person,gender,age_group,ofr_id_short,cnt_completed,cnt_received,cnt_viewed,curiosity_vr,eagerness_cv,overall_cr,...,min_t_tran,max_t_tran,range_t_tran,freq_tran,recency_tran,max_to_sum_am_tran,median_to_mean_am_tran,weekly_tran_mean,weekly_tran_min,weekly_tran_max
0,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_C,0.0,1.0,1.0,0.106,0.000,0.000,...,228.0,696.0,468.0,1.71,713.03,0.220690,0.867712,2.00,1.0,4.0
1,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_G,1.0,1.0,1.0,0.106,1.446,0.256,...,228.0,696.0,468.0,1.71,713.03,0.220690,0.867712,2.00,1.0,4.0
2,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_H,0.0,1.0,1.0,0.256,0.000,0.000,...,228.0,696.0,468.0,1.71,713.03,0.220690,0.867712,2.00,1.0,4.0
3,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_I,1.0,1.0,1.0,0.042,1.933,0.765,...,228.0,696.0,468.0,1.71,713.03,0.220690,0.867712,2.00,1.0,4.0
4,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_J,1.0,1.0,0.0,0.000,2.000,1.000,...,228.0,696.0,468.0,1.71,713.03,0.220690,0.867712,2.00,1.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66496,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_E,1.0,1.0,1.0,0.765,0.256,0.166,...,60.0,648.0,588.0,2.55,713.09,0.103154,1.033179,3.75,3.0,5.0
66497,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_G,1.0,1.0,1.0,0.765,0.026,0.016,...,60.0,648.0,588.0,2.55,713.09,0.103154,1.033179,3.75,3.0,5.0
66498,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_J,1.0,1.0,1.0,0.383,0.166,0.042,...,60.0,648.0,588.0,2.55,713.09,0.103154,1.033179,3.75,3.0,5.0
66499,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_J,1.0,1.0,1.0,0.765,1.000,0.765,...,60.0,648.0,588.0,2.55,713.09,0.103154,1.033179,3.75,3.0,5.0


In [62]:
user_event_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66501 entries, 0 to 66500
Data columns (total 33 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   person                  66501 non-null  object  
 1   gender                  66501 non-null  object  
 2   age_group               66501 non-null  category
 3   ofr_id_short            66501 non-null  object  
 4   cnt_completed           66501 non-null  float64 
 5   cnt_received            66501 non-null  float64 
 6   cnt_viewed              66501 non-null  float64 
 7   curiosity_vr            66501 non-null  float64 
 8   eagerness_cv            66501 non-null  float64 
 9   overall_cr              66501 non-null  float64 
 10  influence               66501 non-null  int64   
 11  ext_influence           66501 non-null  int64   
 12  cnt_tran                65020 non-null  float64 
 13  sum_am_tran             65020 non-null  float64 
 14  mean_am_tran          

## User-Event-U-transactions

 - merging transactions on person_event_time where t_completed in person_event_time is equal to transaction time, getting the completed-event's transaction in the same dataset.

In [63]:
transactions.shape

(138953, 7)

In [64]:
person_event_time.shape

(66501, 19)

In [65]:
# Use a left join to keep all records from user_item_event

transactions['time'] = transactions['time'].astype(float) # avoiding int to float comparision during merge prosses

user_event_u_transactions = person_event_time.merge(transactions, left_on=['person', 't_completed'], right_on=['person', 'time'], how='left')

# dropping columns that contains inf values and the tag helper column
user_event_u_transactions = user_event_u_transactions.drop(columns=['tag', 't_completed', 't_received', 't_viewed', 'stage', 'to_vr', 'to_cv', 'to_cr'])

# Save the final dataset to a CSV file
user_event_u_transactions.to_csv('medalion_data_store/gold/user_event_u_transactions.csv', index=False)


user_event_u_transactions

Unnamed: 0,person,gender,age_group,ofr_id_short,cnt_completed,cnt_received,cnt_viewed,curiosity_vr,eagerness_cv,overall_cr,influence,ext_influence,time,amount,time_diff,week,day
0,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_C,0.0,1.0,1.0,0.106,0.000,0.000,0,0,,,,,
1,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_G,1.0,1.0,1.0,0.106,1.446,0.256,0,0,528.0,14.11,114.0,4.0,22.0
2,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_H,0.0,1.0,1.0,0.256,0.000,0.000,0,0,,,,,
3,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_I,1.0,1.0,1.0,0.042,1.933,0.765,0,0,414.0,8.57,186.0,3.0,18.0
4,0009655768c64bdeb2e877511632db8f,M,Adult,ofr_J,1.0,1.0,0.0,0.000,2.000,1.000,0,0,576.0,10.27,24.0,4.0,24.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66496,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_E,1.0,1.0,1.0,0.765,0.256,0.166,1,0,198.0,22.88,78.0,2.0,9.0
66497,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_G,1.0,1.0,1.0,0.765,0.026,0.016,1,0,60.0,16.06,,1.0,3.0
66498,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_J,1.0,1.0,1.0,0.383,0.166,0.042,1,0,384.0,15.57,72.0,3.0,16.0
66499,ffff82501cea40309d5fdd7edcca4a07,F,Adult,ofr_J,1.0,1.0,1.0,0.765,1.000,0.765,0,1,414.0,17.55,30.0,3.0,18.0


## Gender-Event-Transactions Table

 - merging gender_transactions on gender_event_time 

In [66]:
gender_event_transaction = gender_event_time.merge(gender_transactions, on=['gender', 'age_group'], how='left')

gender_event_transaction = gender_event_transaction.merge(offer_event_time, on=['ofr_id_short'], how='left')

gender_event_transaction.to_csv('medalion_data_store/gold/gender_event_transactions.csv', index=False)

gender_event_transaction

Unnamed: 0,gender,age_group,ofr_id_short,cnt_completed_x,cnt_received_x,cnt_viewed_x,curiosity_vr,eagerness_cv,overall_cr,influence,...,difficulty,cnt_completed_y,cnt_received_y,cnt_viewed_y,time_mean_completed,time_mean_received,time_mean_viewed,comp_ratio,view_ratio,comp-vie-intensity
0,F,Young,ofr_A,67.0,121.0,115.0,0.424,0.046,0.013,1,...,10,3688.0,7658.0,6716.0,394.767,329.760,352.622,0.482,0.877,0.549
1,F,Young,ofr_B,62.0,135.0,130.0,0.300,0.372,0.077,1,...,10,3331.0,7593.0,7298.0,385.722,335.153,353.119,0.439,0.961,0.456
2,F,Young,ofr_C,0.0,126.0,41.0,0.053,0.000,0.000,0,...,0,0.0,7617.0,4144.0,,331.885,358.639,,0.544,
3,F,Young,ofr_D,78.0,121.0,53.0,0.150,0.249,0.023,1,...,5,4354.0,7677.0,4171.0,407.051,334.146,361.977,0.567,0.543,1.044
4,F,Young,ofr_E,64.0,126.0,23.0,1.843,0.000,0.001,1,...,20,3420.0,7668.0,2663.0,431.549,331.336,366.748,0.446,0.347,1.284
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,O,Senior,ofr_F,18.0,25.0,24.0,0.905,0.044,0.037,1,...,7,5156.0,7646.0,7337.0,400.318,336.377,354.750,0.674,0.960,0.703
116,O,Senior,ofr_G,15.0,23.0,23.0,0.596,0.127,0.056,1,...,10,5317.0,7597.0,7327.0,399.117,330.487,348.868,0.700,0.964,0.726
117,O,Senior,ofr_H,0.0,26.0,24.0,1.077,0.000,0.000,0,...,0,0.0,7618.0,6687.0,,332.475,353.934,,0.878,
118,O,Senior,ofr_I,18.0,22.0,20.0,0.084,1.578,0.281,0,...,5,4296.0,7571.0,7264.0,382.936,332.171,349.797,0.567,0.959,0.591
