# Labeling
## Creating the groundtruth

With out cleaned data we can find out which received offers are viewed and completed

### Import the python libaries

In [2]:
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'



### Read the cleaned data files

In [3]:
full_df = pd.read_csv('data/full.csv', low_memory=False)

In [77]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306137 entries, 0 to 306136
Data columns (total 33 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   gender             306137 non-null  object 
 1   age                306137 non-null  float64
 2   person_id          306137 non-null  object 
 3   became_member_on   306137 non-null  object 
 4   income             306137 non-null  float64
 5   year               306137 non-null  int64  
 6   month              306137 non-null  int64  
 7   day                306137 non-null  int64  
 8   member_since_days  306137 non-null  int64  
 9   F                  306137 non-null  int64  
 10  M                  306137 non-null  int64  
 11  O                  306137 non-null  int64  
 12  U                  306137 non-null  int64  
 13  time               306137 non-null  int64  
 14  completed          306137 non-null  int64  
 15  received           306137 non-null  int64  
 16  vi

In [88]:
columns = ['person_id', 'ticks', 'time', 'offer_type', 'received', 'viewed', 'completed' ]

full_df.columns.name='index'
full_df.query('person_id == "68be06ca386d4c31939f3a4f0e3dd783"').loc[[168338, 168339, 168340], columns]


index,person_id,ticks,time,offer_type,received,viewed,completed
168338,68be06ca386d4c31939f3a4f0e3dd783,discount_2_10_10,408,discount,1,0,0
168339,68be06ca386d4c31939f3a4f0e3dd783,discount_2_10_10,408,discount,0,1,0
168340,68be06ca386d4c31939f3a4f0e3dd783,discount_2_10_10,552,discount,0,0,1


### How many offers are provided for customers?


Wieviel unterschiedliche Angebote hat jeder Kunde erhalten. Da wir zehn verschiedene Angebote haben, aber nur 3 Typen von Angebten, nämlich "Bogo", "Discount" und "Informational", möchte ich wissen, wieviele unterschiedliche Typen die Kunden bekommen haben.

How many different offers did each customer receive. Since we have ten different offers, but only three types of offers, namely "Bogo", "Discount" and "Informational", I would like to know how many different types the customers received.


### Create sub dataframes for the labeling process

To find the viewed and completed satus for each received offer I have to create several sub dataframes.

**received**: All rows from full dataframe with column received == 1. This is the baseline dataframe I will update with the vsalues for viewed and completed.

**viewed**: All rows from full dataframe with column viewed == 1.

**completed**: All rows from full dataframe with column completed == 1.


**transaction**: All rows from full dataframe with column transaction == 1.

**offer_received**: Based on received dataframe but without informational offers.

**advert_received**: Based on received dataframe but only informational offers.

**offer_viewed**: Based on viewed dataframe but without informational offers.

**offer_completed**: Based on completed dataframe but without informational offers.

**advert_viewed**: Based on viewed dataframe but only informational offers.



In [33]:
# create a sub dataframe only with offer received events
received = full_df[full_df['received'] == 1]
# create a sub dataframe only with offer viewed events
viewed = full_df[full_df['viewed'] == 1]
# create a sub dataframe only with offer completed events
completed = full_df[full_df['completed'] == 1]
# create a sub dataframe only with transaction events
transaction = full_df[full_df['transaction'] == 1]
# create a subreceived dataframe only with offer_type != Informational
offer_received = received[received.offer_type != 'informational']
# create a subreceived dataframe only with offer_type == Informational
advert_received = received[received.offer_type == 'informational']
# create a subviewed dataframe only with offer_type != Informational
offer_viewed = viewed[viewed.offer_type != 'informational']
# create a subviewed dataframe only with offer_type == Informational
advert_viewed = viewed[viewed.offer_type == 'informational']
# create a subcompleted dataframe only with offer_type != Informational
offer_completed = completed[completed.offer_type != 'informational']


In [34]:
# Group the dataframe by offer_type and aggregate the size per group
offer_count = received.groupby(['person_id', 'offer_type']).size().reset_index()
# Use pd pivot fuction to create a dataframe with index=persons cloumns=offer_types and as the values the values from size
offer_count = offer_count.pivot(index='person_id', columns='offer_type', values=0)
# Create a column unique_offers with count in axis=1 for type columns
offer_count['unique_offers'] = offer_count.iloc[:,:3].count(axis=1).values
# Create a column total_offers with sum in axis=1 for type columns
offer_count['total_offers'] = offer_count.iloc[:,:3].sum(axis=1).values

offer_count

offer_type,bogo,discount,informational,unique_offers,total_offers
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0009655768c64bdeb2e877511632db8f,1.0,2.0,2.0,3,5.0
00116118485d4dfda04fdbaba9a87b5c,2.0,,,1,2.0
0011e0d4e6b944f998e987f904e8c1e5,1.0,2.0,2.0,3,5.0
0020c2b971eb4e9188eac86d93036a77,2.0,2.0,1.0,3,5.0
0020ccbbb6d84e358d3414a3ff76cffd,2.0,1.0,1.0,3,4.0
...,...,...,...,...,...
fff3ba4757bd42088c044ca26d73817a,1.0,3.0,2.0,3,6.0
fff7576017104bcc8677a8d63322b5e1,3.0,2.0,,2,5.0
fff8957ea8b240a6b5e634b6ee8eafcf,1.0,1.0,1.0,3,3.0
fffad4f4828548d1b5583907f2e9906b,3.0,,1.0,2,4.0


### Working process

We have 306137 trabscript rows. 167184 rows are affected by offer events, while 138953 event are regular transactions.
The events affected by offers are recieved, viewed and completed. Accept for informational offers. Here is no completed event available. Instead an informational offer is completed, when within the duration a transaction is performed to the corresponding informational offer.
Another problem is that offers have "received" as event but do not necessarily have a "viewed" or "completed" one. It is therefore necessary to carry out a preprocessing step that adds the information about "viewed" and "completed" to the "received" offers.



### Method to extract viewed and completed informations for each received offer

The input for this method is the received dataframe and a corresponding match dataframe, this means the viewed or the completed dataframe.
* The method loops over all unique customers from received dataframe.

```python
for i, person in enumerate(tqdm(received_df.person_id.unique())):
```
* For each customer an received and match sub dataframe will be created.

```python
sub_received = received_df[received_df.person_id == person]
sub_match = match_df[match_df.person_id == person]
```
* Loop  over events in customer received sub dataframe

```python
for j, (time, offer_id, validity) in enumerate(zip(sub_received.time, sub_received.offer_id, sub_received.validity)):
```
* Try to find a match to customer viewed sub dataframe

```python
match = sub_viewed.query('@time <= time <= @validity')
```
* Dependet if a match is found add the index to an in_idx or out_idx list

```python
if len(match) == 0:
    out_idx.append(sub_received.iloc[j].name)
else:
    in_idx.append((sub_received.iloc[j].name))
```



In [35]:
from tqdm import tqdm


def find_matches(received_df, match_df, transaction=False):
    
    '''This function loops over the received DataFrame and search for corresponding events in the match DataFrame. 
    A corresponding event has same offer_id, same person, and the time is in the validity timeframe.
    This function takes two dataframes. The received dataframe as baseline and the match dataframe to find a match'''
    
    out_idx = []  ## index for received df where no match is located
    in_idx = [] ## index for received df where match is located
    match_length = [] ## For each match the length
    out_df = pd.DataFrame() ## DataFrame with matches
    in_amount = []
    
    
    for i, person in enumerate(tqdm(received_df.person_id.unique())):
        

        # Erstelle für die aktuelle Person und sub DataFrame aus dem offer received DataFrame
        sub_received = received_df[received_df.person_id == person]

        # Erstelle für die aktuelle Person und sub DataFrame aus dem offer completed DataFrame
        sub_match = match_df[match_df.person_id == person]

        # Loop über den sub offer received Dataframe
        # iloc position, time, offer_id, validity
        for j, (time, offer_id, validity) in enumerate(zip(sub_received.time, sub_received.offer_id, sub_received.validity)):

            # Finde in den sub offer completed einen match mit gleicher offer id und Zeit innerhalb der Gültigkeit
            if transaction:
                match = sub_match.query('@time <= time <= @validity')
                match.loc[:, 'offer_id'] = offer_id
                match['offer_received'] = '1'
                match['offer_viewed'] = '1'
                match['offer_completed'] = str(len(match))
            else:
                match = sub_match.query('offer_id == @offer_id and @time <= time <= @validity')
            
            # Wenn kein Match gefunden wurde, nehme den globalen index namen des sub offer received
            # Und speichere ihn in eine Liste
            if len(match) == 0:
                out_idx.append(sub_received.iloc[j].name)
            else:
                in_idx.append((sub_received.iloc[j].name))
                in_amount.append(sub_received.iloc[j].amount)
                # Information ob es mehr als einen match gegeben hat
                match_length.append(len(match))
                # Der in der Funktion sich aufbauende matched DataFrame
                out_df = pd.concat([out_df, match])
    
    match_length = pd.DataFrame({'length': match_length})
    
    if transaction:
        return out_idx, in_idx, out_df, match_length, in_amount
    else:
        return out_idx, in_idx, out_df, match_length
    




### Extract offer viewed informations

Use the offer_received and offer_viewed sub dataframes as input

In [36]:
# Check for offer viewed matches from viewed dataframe
offer_not_viewed_idx, offer_viewed_idx, offer_viewed_build, offer_viewed_match_length = \
find_matches(offer_received, offer_viewed)

100%|██████████| 16928/16928 [15:44<00:00, 17.92it/s]


Use the `in_idx` and `out_idx` list to fill the viewed columns with labels 1 for matched offers and -1 for offers where no viewed offer could be found.

In [37]:
# Add -1 to event_offer viewed column where no match was located
received.loc[offer_not_viewed_idx, 'viewed'] = -1
# Add 1 to event_offer viewed columns where match was located
received.loc[offer_viewed_idx, 'viewed'] = 1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


### Extract offer completed informations

Use the offer_received and offer_completed sub dataframes as input

In [38]:
# Check for offer completed matches from completed dataframe
offer_not_completed_idx, offer_completed_idx, offer_completed_build, offer_completed_match_length = \
find_matches(offer_received, offer_completed)

100%|██████████| 16928/16928 [12:04<00:00, 23.38it/s]


Use the `in_idx` and `out_idx` list to fill the completed columns with labels 1 for matched offers and -1 for offers where no completed offer could be found.

In [39]:
# Add -1 to event_offer viewed column where no match was located
received.loc[offer_not_completed_idx, 'completed'] = -1
# Add 1 to event_offer viewed columns where match was located
received.loc[offer_completed_idx, 'completed'] = 1



### Extract advert viewed informations
Use the offer_received and advert_viewed sub dataframes as input

In [43]:
# Check for offer viewed matches from transaction dataframe
advert_not_viewed_idx, advert_viewed_idx, advert_viewed_build, advert_viewed_match_length = \
find_matches(advert_received, advert_viewed)

100%|██████████| 10547/10547 [03:03<00:00, 57.53it/s]


Use the `in_idx` and `out_idx` list to fill the viewed columns with labels 1 for matched offers and -1 for offers where no viewed offer could be found.

In [44]:
# Add -1 to event_offer viewed column where no match was located
received.loc[advert_not_viewed_idx, 'viewed'] = -1
# Add 1 to event_offer viewed columns where match was located
received.loc[advert_viewed_idx, 'viewed'] = 1




### Extract advert completed informations

Use the advert_received and transaction sub dataframes as input. For the informational offers there are no completed offers as match available. I have to extract the completed event from regular transaction. In this case the method get an additional argument `transaction=True`.

The offer_id from advert_received will not checked again the offer_id from transaction, because there is no offer_id.


In [45]:
# Check for offer completed matches from transaction dataframe
advert_not_completed_idx, advert_completed_idx, advert_completed_build, advert_completed_match_length, amount= \
find_matches(advert_received, transaction, transaction=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
100%|██████████| 10547/10547 [04:40<00:00, 37.61it/s]


Use the `in_idx` and `out_idx` list to fill the completed columns with labels 1 for matched offers and -1 for offers where no completed offer could be found.

In [46]:
# Add -1 to event_offer viewed column where no match was located
received.loc[advert_not_completed_idx, 'completed'] = -1
# Add 1 to event_offer viewed columns where match was located
received.loc[advert_completed_idx, 'completed'] = 1


### Check of received dataframe
Now I have for each received offer labels created for viewed and completed. Let's check if there are still null values in the received offers.

In [47]:
print('Offer viewed')
print(received['viewed'].value_counts())
print('Offer completed')
print(received['completed'].value_counts())

Offer viewed
 1    56895
-1    19382
Name: viewed, dtype: int64
Offer completed
 1    43114
-1    33163
Name: completed, dtype: int64


### Write the received dataframe to a csv file

The received dataframe does not contain null values in viewed and completed column. All received offers have the information wether they viewed or not viewed and wether they completed or not completed.

In [48]:
# Write csv file for received dataframe
received.to_csv('data/received.csv', index=True)

### Go on with Explorative Data Analysis.

In [None]:
received.info