<a href="https://colab.research.google.com/github/Ilvecho/Project_7/blob/main/Project_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The data available for this project are **real data** so they are dirty.

A lot of values are missing, so we need to think how to handle them.

Normally, several hypothesis need to be tested, but in this case we are short in time, so we would make just one iteration of reasonable assumptions.

Furthermore, the data contains the **date** of each step, as well as the date of first and last contact.

We will leverage these pieces of information mainly to fill out some missing values. However, for the model itself we will just consider the binary state (yes - no) rather than the dates. This choice is driven by the need to prioritize simplicity for time constraints.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import os
import pickle
import re
import random

import torch
from torch import nn
from torch import flatten
from torch.utils.data import TensorDataset, DataLoader
from torch.optim import Adam

from google.colab import files,drive
drive.mount('/content/gdrive')

# get the GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Mounted at /content/gdrive


In [2]:
data = pd.read_csv('/content/gdrive/MyDrive/P7_files/SalesCRM - CRM.csv')

In [3]:
data.shape

(11032, 13)

In [4]:
for col in data.columns:
  print(col)

ID
Country
Education
First Contact
Last Contact
Status
Stage
First Call
Signed up for a demo
Filled in customer survey
Did sign up to the platform
Account Manager assigned
Subscribed


In [5]:
rel_cols = data.drop(columns=['ID', 'Country', 'Education'])
nan_rows = data[rel_cols.isna().all(axis=1)]
nan_rows

Unnamed: 0,ID,Country,Education,First Contact,Last Contact,Status,Stage,First Call,Signed up for a demo,Filled in customer survey,Did sign up to the platform,Account Manager assigned,Subscribed
125,126,Canada,,,,,,,,,,,
189,191,,,,,,,,,,,,
190,192,,,,,,,,,,,,
275,278,Canada,,,,,,,,,,,


Let's delete the empty rows: they give no added value

In [6]:
# Many of the cells have value NaN. Let's replace it with a different value so that we are able to count the number of NaN's as well
parsed_data = data[~rel_cols.isna().all(axis=1)].copy()

# Fill the NaN
parsed_data['Country'] = parsed_data['Country'].fillna('missing')
parsed_data['Education'] = parsed_data['Education'].fillna('missing')
parsed_data['Status'] = parsed_data['Status'].fillna('missing')
parsed_data['Stage'] = parsed_data['Stage'].fillna('missing')


parsed_data.set_index('ID', inplace=True)

# Date features parsing

All the other features represent a date in time, so we need convert the columns to **datetime format**.

Let's define a parser function

In [7]:
# define the datetime parser function
def datetime_parser(column):

  # Iterate through all the rows, because different rows can have different datetime patterns
  for i in list(column.index):

    # Define the patterns observed in the data
    pattern_1 = re.compile(r'^[0-9]{4}\-[0-9]{2}\-[0-9]{4}$')
    pattern_2 = re.compile(r'^[0-9]{2}\.[0-9]{2}\.[0-9]{4}$')
    pattern_3 = re.compile(r'^[0-9]{2}\-[0-9]{2}\-[0-9]{4}$')

    # If the row is already in Datetime format, pass
    if isinstance(column.loc[i], pd.Timestamp):
      pass

    # If the value is missing, assign Not a Time
    elif pd.isna(column.loc[i]):
      column.loc[i] = pd.NaT

    # If the string matches the first patter, convert accordingly
    elif pattern_1.match(column.loc[i]):
      column.loc[i] = pd.to_datetime(column.loc[i], format='%Y-%m-%d')

    # If the string matches the second patter, convert accordingly
    elif pattern_2.match(column.loc[i]):
      column.loc[i] = pd.to_datetime(column.loc[i], format='%d.%m.%Y')

    # If the string matches the third patter, we are still not sure about the format
    elif pattern_3.match(column.loc[i]):
      # So, try one format and catch exceptions and use the other possible format
      try:
        column.loc[i] = pd.to_datetime(column.loc[i], format='%d-%m-%Y')
      except:
        column.loc[i] = pd.to_datetime(column.loc[i], format='%m-%d-%Y')

  return pd.to_datetime(column)

Manually fix an exception

In [8]:
# There is one row that has format Y-d-m rather then Y-m-d
idx = parsed_data[parsed_data['First Contact'] == '2021-13-12'].index
parsed_data['First Contact'].loc[idx] = "2021-12-13"
parsed_data['Last Contact'].loc[idx] = "2021-12-13"

Parse the **First Contact** column

In [9]:
parsed_data['First Contact'] = datetime_parser(parsed_data['First Contact'])

Parse the **Last Contact** column

In [10]:
parsed_data['Last Contact'] = datetime_parser(parsed_data['Last Contact'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.NaT
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.to_datetime(column.loc[i], format='%d.%m.%Y')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.to_datetime(column.loc[i], format='%m-%d-%Y')


Parse the **First Call** column

In [11]:
parsed_data['First Call'] = datetime_parser(parsed_data['First Call'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.NaT


Parse the **Signed up for a demo** column

In [12]:
parsed_data['Signed up for a demo'] = datetime_parser(parsed_data['Signed up for a demo'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.NaT


Parse the **Filled in customer survey** column

In [13]:
parsed_data['Filled in customer survey'] = datetime_parser(parsed_data['Filled in customer survey'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.NaT


Parse the **Did sign up to the platform** column.

Actually we notice that some cells in this column do not contain a datetime value, but rather the string "No"

In [14]:
mask = (parsed_data['Did sign up to the platform'] == 'No')
did_not_signed = parsed_data[mask]
did_not_signed['First Call'].isna().sum()

0

Looking at the "First Call" column of these rows, we notice that in **all cases** the first call happened and the potential customer decided not to Sign up to the platform. Hopefully our model will be able to pick up this trend.

Now we need to remove the 'No' values from the column, but we stil want to preserve the information.

Let's create a **new boolean feature**

In [15]:
# first let's rename the original column
parsed_data.rename(columns={'Did sign up to the platform': 'Date Platform sign up'}, inplace=True)
# Create the new column
parsed_data['Bool Platform sign up'] = np.nan

# Create a mask to find the rows where Customer actually did signed up
signed_mask = (parsed_data['Date Platform sign up'].notna()) & (parsed_data['Date Platform sign up'] != 'No')
parsed_data['Bool Platform sign up'][signed_mask] = 1

# Create a mask for the rows where we know that customer DID NOT signed up
not_signed_mask = (parsed_data['Date Platform sign up'] == 'No')
parsed_data['Bool Platform sign up'][not_signed_mask] = -1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  parsed_data['Bool Platform sign up'][signed_mask] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  parsed_data['Bool Platform sign up'][not_signed_mask] = -1


Now we can actually proceed to parsing the column 'Date Platform sign up'

In [16]:
parsed_data['Date Platform sign up'] = parsed_data['Date Platform sign up'].replace('No', pd.NaT)

# When running the parser we notice that one cell has a typo in the year: 20221
# let's find the row and fix the issue
idx = np.where(parsed_data['Date Platform sign up'] == '20221-08-24')
parsed_data['Date Platform sign up'].iloc[idx] = '2021-08-24'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  parsed_data['Date Platform sign up'].iloc[idx] = '2021-08-24'


In [17]:
# Finally parse the column
parsed_data['Date Platform sign up'] = datetime_parser(parsed_data['Date Platform sign up'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.NaT


In [18]:
parsed_data['Account Manager assigned'] = datetime_parser(parsed_data['Account Manager assigned'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.NaT


In [19]:
# Also column "subscribed" has an unexpected value - we need to manually modify it
idx = np.where(parsed_data['Subscribed'] == '0000-00-00')
parsed_data['Subscribed'].iloc[idx] = pd.NaT

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  parsed_data['Subscribed'].iloc[idx] = pd.NaT


In [20]:
parsed_data['Subscribed'] = datetime_parser(parsed_data['Subscribed'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  column.loc[i] = pd.NaT


In [21]:
parsed_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11028 entries, 1 to 11793
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Country                    11028 non-null  object        
 1   Education                  11028 non-null  object        
 2   First Contact              10317 non-null  datetime64[ns]
 3   Last Contact               10722 non-null  datetime64[ns]
 4   Status                     11028 non-null  object        
 5   Stage                      11028 non-null  object        
 6   First Call                 441 non-null    datetime64[ns]
 7   Signed up for a demo       283 non-null    datetime64[ns]
 8   Filled in customer survey  184 non-null    datetime64[ns]
 9   Date Platform sign up      296 non-null    datetime64[ns]
 10  Account Manager assigned   71 non-null     datetime64[ns]
 11  Subscribed                 47 non-null     datetime64[ns]
 12  Bool

# Exploratory Data Analysis - Country

In [22]:
# Bar plot for the counties
fig = px.bar(
    parsed_data,
    x=parsed_data.groupby('Country').size().values,
    y=parsed_data.groupby('Country').size().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Country',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()

I did not expect these many values for the country. Let's analyze this more in details:

In [23]:
parsed_data['Country'].value_counts()

USA                         6604
missing                      889
Canada                       815
France                       336
UK                           329
                            ... 
uSA                            1
Hong Kong                      1
Singapore                      1
Cameroon                       1
Central African Republic       1
Name: Country, Length: 104, dtype: int64

There are **104** different Country values.
Actually there are some typos (e.g. uSA), so let's elaborate a bit the strings to get a more informative overview

In [24]:
countries = parsed_data['Country'].str.rstrip().str.lower()
countries.value_counts()

usa                         6641
missing                      889
canada                       816
france                       336
uk                           331
                            ... 
full time                      1
bulgaria                       1
senegal                        1
jordan                         1
central african republic       1
Name: Country, Length: 94, dtype: int64

The modification showed indeed some typos. \

We also added a **rstrip** function to remove trailing spaces.

The new count is **94**

In [25]:
countries.value_counts().tail(30)

bolivia                     2
kenya                       2
malaysia                    2
sri lanka                   2
philadelphia                1
congo                       1
california                  1
czechia                     1
russia&ukraine              1
nottingham                  1
korea                       1
bulgaria & uk               1
vietnamese                  1
greek                       1
romania                     1
turkey                      1
latvia                      1
hong kong                   1
seoul                       1
venezuela                   1
chuang                      1
benin                       1
guinea                      1
cameroon                    1
serbia                      1
full time                   1
bulgaria                    1
senegal                     1
jordan                      1
central african republic    1
Name: Country, dtype: int64

Looking at the tail, we still notice some duplicates (non compehensive list):
- Czechia is mentioned twice
- Bulgaria
- Nottingham is a city, not a country. Same is true for Seul, philadelphia and california
- Vietnamese is not a country
- Full time is not a country

Hence, let's create a lookup table to pre-process the country column

In [26]:
lookup_table = {
    "nottingham": "uk",
    "seoul": "south korea",
    "greek": "greece",
    "vietnamese": "vietnam",
    "bulgaria & uk": "bulgaria",
    "korea": "south korea",
    "russia&ukraine": "russia",
    "california": "usa",
    "philadelphia": "usa",
    "ca": "usa",
    'england': "uk",
    "dubai": "united arab emirates",
    "czechia (czech republic)": "czechia",
    "chuang": "china",
    "-": "missing"
}

In [27]:
countries = countries.apply(lambda x: lookup_table[x] if x in lookup_table.keys() else x)
countries.value_counts()

usa                         6645
missing                      899
canada                       816
france                       336
uk                           334
                            ... 
benin                          1
cameroon                       1
full time                      1
turkey                         1
central african republic       1
Name: Country, Length: 79, dtype: int64

Now we have **79** unique values

In [28]:
parsed_data['Country'] = countries.copy()

# Bar plot for the counties
fig = px.bar(
    parsed_data,
    x=countries.value_counts(),
    y=countries.value_counts().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Country',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()

The feature 'Country' will be part of the **state**:

we will **one-hot encode** the countries to obtain the first part of the state vector (size 79).

Note that this part of the state **cannot be changed** by any of the actions: it is a bio information

In [29]:
idx = np.where(parsed_data['Country'] == 'missing')
print(f'Percentage of Missing values in Column Country: {idx[0].shape[0]/parsed_data.shape[0] * 100 :.2f} %')

Percentage of Missing values in Column Country: 8.15 %


All the rows of this column must have a value - the value 'Missing' does not make sense.

Since we have no additional information, we decide to fill out the missing values with the **10 most observed categories** in the observed proportion.

In [30]:
# Select the top 10 observed categories
top_categories = parsed_data['Country'].value_counts().nlargest(10).index

# Filter the data to include only the top 10 categories
parsed_data_top_categories = parsed_data[parsed_data['Country'].isin(top_categories)]
# Drop the Missing
parsed_data_top_categories = parsed_data_top_categories[parsed_data_top_categories['Country'] != 'missing']

missing_values_count = idx[0].shape[0]

# Replace 'Missing' values using observed categories in proportion (top 10)
replacement_values = parsed_data_top_categories['Country'].sample(
    n=missing_values_count,
    replace=True,
    weights=parsed_data_top_categories.groupby('Country')['Country'].transform('count')
).values

# Replace the missing values
parsed_data['Country'].iloc[idx[0]] = replacement_values

# Bar plot for the counties
fig = px.bar(
    parsed_data,
    x=parsed_data['Country'].value_counts(),
    y=parsed_data['Country'].value_counts().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Country',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



# Exploratory Data Analysis - Education

In [31]:
# Bar plot for the counties
fig = px.bar(
    parsed_data,
    x=parsed_data.groupby('Education').size().values,
    y=parsed_data.groupby('Education').size().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Education',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()

the education category has a limited number of options - good.

Still important to notice that a lot of samples have value **missing**.

In [32]:
parsed_data['Education'].value_counts()

missing    3599
B27        1942
B11         923
B10         670
B9          635
B1          393
B29         353
B21         331
B17         227
B14         218
B30         218
B16         213
B19         168
B28         132
B8          129
B12          93
B15          87
B13          77
B25          76
B24          70
B22          69
B26          62
B23          57
B18          57
B3           42
B20          39
B2           35
B6           34
B4           33
B5           27
B7           19
Name: Education, dtype: int64

In [33]:
len(parsed_data['Education'].value_counts())

31

The feature 'Education' will be also part of the **state**:

we will **one-hot encode** the education levels to obtain the second part of the state vector (size 31).

Note that this part of the state **cannot be changed** by any of the actions: it is a bio information

In [34]:
idx = np.where(parsed_data['Education'] == 'missing')
print(f'Percentage of Missing values in Column Education: {idx[0].shape[0]/parsed_data.shape[0] * 100 :.2f} %')

Percentage of Missing values in Column Education: 32.64 %


All the rows of this column must have a value - the value 'Missing' does not make sense.

Since we have no additional information, we decide to fill out the missing values with all the possible categories the in the observed proportion.

In [35]:
# Select the top 10 observed categories
top_categories = parsed_data['Education'].value_counts().index

# Drop the Missing
parsed_data_no_missing = parsed_data[parsed_data['Education'] != 'missing']

missing_values_count = idx[0].shape[0]

# Replace 'Missing' values using observed categories in proportion (top 10)
replacement_values = parsed_data_no_missing['Education'].sample(
    n=missing_values_count,
    replace=True,
    weights=parsed_data_no_missing.groupby('Education')['Education'].transform('count')
).values

# Replace the missing values
parsed_data['Education'].iloc[idx[0]] = replacement_values

# Bar plot for the counties
fig = px.bar(
    parsed_data,
    x=parsed_data['Education'].value_counts(),
    y=parsed_data['Education'].value_counts().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Education',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



# Exploratory Data Analysis - Status

The **Status** column, formerly "Message State", indicates how many times a potential customer was contacted. If the customer at some point is interested, they will proceed with the demo call

In [36]:
fig = px.bar(
    parsed_data,
    x=parsed_data.groupby('Status').size().values,
    y=parsed_data.groupby('Status').size().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Status',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()

Also the status will be part of the **state space**:

We will encode it as the Integer number of messages sent to the potential client.

Note that this part of the state **can be changed** via an action

In [37]:
idx = np.where(parsed_data['Status'] == 'missing')
parsed_data.iloc[idx[0]]

Unnamed: 0_level_0,Country,Education,First Contact,Last Contact,Status,Stage,First Call,Signed up for a demo,Filled in customer survey,Date Platform sign up,Account Manager assigned,Subscribed,Bool Platform sign up
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,usa,B27,NaT,NaT,missing,missing,NaT,NaT,NaT,2022-04-27,NaT,NaT,1.0
2,austria,B27,NaT,NaT,missing,missing,2022-04-28,2022-04-25,2022-04-25,2022-04-25,NaT,NaT,1.0
3,united arab emirates,B27,NaT,NaT,missing,missing,NaT,2022-04-24,NaT,NaT,NaT,NaT,
4,france,B27,NaT,NaT,missing,missing,2022-04-22,2022-04-20,2022-04-20,2022-04-22,2022-04-22,NaT,1.0
5,usa,B27,NaT,NaT,missing,missing,2022-04-23,2022-04-19,2022-04-19,NaT,NaT,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9421,usa,B27,NaT,2021-09-01,missing,missing,NaT,NaT,NaT,NaT,NaT,NaT,
9422,usa,B27,NaT,2021-09-01,missing,missing,NaT,NaT,NaT,NaT,NaT,NaT,
9423,mexico,B1,NaT,2021-09-01,missing,missing,2021-09-30,2021-12-22,2021-12-22,2021-12-22,NaT,NaT,1.0
9830,usa,B11,2021-11-10,2021-11-10,missing,missing,NaT,NaT,NaT,NaT,NaT,NaT,


Looking at the rows with Missing values, we can see that some of them show later steps in the onboarding process.

Hence, we are going to deal with the Status missing values as follow:
- If the 'First Contact' or 'Last Contact' columns have a value, we are going to fill in the missing Status value
- If the 'Stage' has a value, we are going to fill 1st message
- If any of the onboarding steps have been done, then  we will change the status to 1st message.

In [38]:
mask_contact = parsed_data['First Contact'].notna() | parsed_data['Last Contact'].notna() | (parsed_data['Stage'] != 'missing')
mask_onboarding = (parsed_data['First Call'].notna() | parsed_data['Signed up for a demo'].notna() | parsed_data['Filled in customer survey'].notna()
                  | parsed_data['Date Platform sign up'].notna() | parsed_data['Account Manager assigned'].notna() | parsed_data['Subscribed'].notna())
mask_missing = parsed_data['Status'] == 'missing'
mask = (mask_contact | mask_onboarding) & mask_missing
parsed_data['Status'][mask]

ID
1        missing
2        missing
3        missing
4        missing
5        missing
          ...   
9421     missing
9422     missing
9423     missing
9830     missing
11154    missing
Name: Status, Length: 799, dtype: object

In [39]:
parsed_data['Status'][mask] = '1st message'

fig = px.bar(
    parsed_data,
    x=parsed_data.groupby('Status').size().values,
    y=parsed_data.groupby('Status').size().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Status',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [40]:
parsed_data[parsed_data['Status'] == 'missing']

Unnamed: 0_level_0,Country,Education,First Contact,Last Contact,Status,Stage,First Call,Signed up for a demo,Filled in customer survey,Date Platform sign up,Account Manager assigned,Subscribed,Bool Platform sign up
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1


The 4 remaining missing values will be considered **outliers and dropped**

In [41]:
parsed_data = parsed_data[parsed_data['Status'] != 'missing']

# Exploratory Data Analysis - Stage

The column stage reflects the potential customer reaction before and/or after the demo call

In [42]:
fig = px.bar(
    parsed_data,
    x=parsed_data.groupby('Stage').size().values,
    y=parsed_data.groupby('Stage').size().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Stage',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()

The feature 'Stage' will be also part of the **state**:

we will **one-hot encode** the education levels to obtain the second part of the state vector (size 7).

Note that this part of the state **can be changed**: indeed it starts empty and based on previous actions it can be filled out

For simplicity sake, we can **group** the categories in order to have just four:
- missing (i.e. unknown)
- not interested
- interested
- subscribed already

In [43]:
mask = ((parsed_data['Stage'] == 'not interested') | (parsed_data['Stage'] == 'do not contact') | (parsed_data['Stage'] == 'did not join the call') | (parsed_data['Stage'] == 'declined/canceled call') )
parsed_data['Stage'][mask] = 'not interested'

fig = px.bar(
    parsed_data,
    x=parsed_data.groupby('Stage').size().values,
    y=parsed_data.groupby('Stage').size().index
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Stage',
    xaxis=dict(showline=False, showgrid=False),
    yaxis=dict(showline=False, showgrid=False)
)

fig.show()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



check if amongst the not interested, some have done the onboarding steps

In [44]:
idx = np.where(parsed_data['Stage'] == 'not interested')
tmp = parsed_data.iloc[idx]

tmp['First Call'].notna().sum()

126

We notice that some of the 'not interested' potential customers actually had a first call.

This indicates that the column Stage is updated also **during the onboarding process**

This means that it would be **very hard** to fill out the missing values with the available information

There is one thing we can do though:

since we understood that the feature Stage is updated also during the onboarding process, we can use its values 'subscribed already' to fill out **missing subscription dates**

In [45]:
mask = (parsed_data['Stage'] == 'subscribed already') & parsed_data['Subscribed'].isna()
parsed_data['Subscribed'][mask] = pd.to_datetime('2023-11-20')
parsed_data[mask]



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,Country,Education,First Contact,Last Contact,Status,Stage,First Call,Signed up for a demo,Filled in customer survey,Date Platform sign up,Account Manager assigned,Subscribed,Bool Platform sign up
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
170,usa,B27,NaT,NaT,1st message,subscribed already,2021-11-26,2021-11-24,2021-11-24,2021-11-24,2021-12-01,2023-11-20,1.0
213,canada,B27,NaT,NaT,1st message,subscribed already,2021-11-03,2021-10-26,2021-10-26,2021-11-06,2021-11-06,2023-11-20,1.0
7889,usa,B27,2021-05-15,2021-05-15,1st message,subscribed already,2021-07-08,NaT,NaT,2021-07-08,2021-08-02,2023-11-20,1.0


All the above analysis highlighted the complex relationship between the Stage column and the other columns.

Hence, for simplicity sake we will **drop this column**

In [46]:
data_to_use = parsed_data.copy()
data_to_use.drop(columns=['Stage'], inplace=True)

# Date features analysis

We can leverage the date values of the features to find the order of the steps in each case.

First thing, we want to check if the "Last contact" date records also contacts **during** the onboarding steps

In [47]:
onboarding_only = parsed_data.drop(columns=['Country', 'Education', 'First Contact', 'Last Contact', 'Status', 'Stage', 'Bool Platform sign up'])
parsed_data['First step'] = onboarding_only.min(axis=1)
parsed_data['Last step'] = onboarding_only.max(axis=1)

In [48]:
# Check if the first step of the onboarding process happened BEFORE the last contact date recorded
mask = (parsed_data['First step'] < parsed_data['Last Contact'])

parsed_data[mask]

Unnamed: 0_level_0,Country,Education,First Contact,Last Contact,Status,Stage,First Call,Signed up for a demo,Filled in customer survey,Date Platform sign up,Account Manager assigned,Subscribed,Bool Platform sign up,First step,Last step
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
75,usa,B27,NaT,2022-03-01,1st message,missing,NaT,NaT,NaT,2022-02-19,NaT,NaT,1.0,2022-02-19,2022-02-19
304,usa,B27,2021-08-04,2021-08-04,3rd message,interested,2021-07-27,NaT,NaT,NaT,NaT,NaT,-1.0,2021-07-27,2021-07-27
305,canada,B27,2021-07-21,2022-03-18,1st message,missing,2021-07-27,NaT,NaT,2021-07-27,NaT,NaT,1.0,2021-07-27,2021-07-27
356,usa,B11,2021-06-25,2021-08-25,3rd message,not interested,NaT,NaT,NaT,2021-08-24,NaT,NaT,1.0,2021-08-24,2021-08-24
374,usa,B27,2020-10-04,2021-06-23,1st message,subscribed already,2021-06-22,NaT,NaT,2021-06-22,2021-07-01,2021-06-28,1.0,2021-06-22,2021-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7596,india,B1,2020-12-23,2022-02-21,1st message,missing,2020-12-24,NaT,NaT,NaT,NaT,NaT,,2020-12-24,2020-12-24
7653,iran,B1,2020-12-23,2021-11-29,1st message,missing,2020-12-29,NaT,NaT,2020-12-29,NaT,NaT,1.0,2020-12-29,2020-12-29
7812,usa,B27,2020-05-13,2021-10-12,1st message,interested,2020-10-15,NaT,NaT,2020-10-15,NaT,NaT,1.0,2020-10-15,2020-10-15
9068,usa,B27,2021-09-01,2021-09-07,2nd message,interested,2021-08-31,NaT,NaT,NaT,NaT,NaT,-1.0,2021-08-31,2021-08-31


apparently in some cases the Last Contact column is updated even after the onboarding process is started:

This observation show the high complexity of the data we have available

# Date features analysis - First and Last contact

let's check the order of the dates.

First we focus on the **First Contact** and **Last Contact** columns.

We expect the Last contact to be **greater or equal** to the First Contact, but as we can see there are (very few) outliers

In [49]:
print('Number of rows in which the First Contact happened BEFORE the Last contact: ', (parsed_data['First Contact'] <= parsed_data['Last Contact']).sum())
print('Number of rows in which the First Contact happened AFTER the Last contact: ', (parsed_data['First Contact'] > parsed_data['Last Contact']).sum())

Number of rows in which the First Contact happened BEFORE the Last contact:  10292
Number of rows in which the First Contact happened AFTER the Last contact:  3


For **simplicity sake** we will drop this two columns in the model

In [50]:
data_to_use.drop(columns=['First Contact', 'Last Contact'], inplace=True)

# Date features analysis - First Call

In [51]:
col_1 = 'First Call'
col_2 = 'Signed up for a demo'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the First Call happened BEFORE the Signed up for a demo:  10
Number of rows in which the First Call happened AT THE SAME TIME as the Signed up for a demo:  9
Number of rows in which the First Call happened AFTER the Signed up for a demo:  96


In [52]:
col_1 = 'First Call'
col_2 = 'Filled in customer survey'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the First Call happened BEFORE the Filled in customer survey:  10
Number of rows in which the First Call happened AT THE SAME TIME as the Filled in customer survey:  10
Number of rows in which the First Call happened AFTER the Filled in customer survey:  93


In [53]:
col_1 = 'First Call'
col_2 = 'Date Platform sign up'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the First Call happened BEFORE the Date Platform sign up:  29
Number of rows in which the First Call happened AT THE SAME TIME as the Date Platform sign up:  151
Number of rows in which the First Call happened AFTER the Date Platform sign up:  57


In [54]:
col_1 = 'First Call'
col_2 = 'Account Manager assigned'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the First Call happened BEFORE the Account Manager assigned:  60
Number of rows in which the First Call happened AT THE SAME TIME as the Account Manager assigned:  9
Number of rows in which the First Call happened AFTER the Account Manager assigned:  0


In [55]:
col_1 = 'First Call'
col_2 = 'Subscribed'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the First Call happened BEFORE the Subscribed:  49
Number of rows in which the First Call happened AT THE SAME TIME as the Subscribed:  0
Number of rows in which the First Call happened AFTER the Subscribed:  1


It is hard to draw a conclusion on the order of the first steps.

However, we can confidently say that the First Call surely happens **before**:
- Account Manager assigned
- Subscribed (in this case there is one outlier, we will need to handle it)

# Date features analysis - Signed up for a demo

In [56]:
col_1 = 'Signed up for a demo'
col_2 = 'Filled in customer survey'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Signed up for a demo happened BEFORE the Filled in customer survey:  71
Number of rows in which the Signed up for a demo happened AT THE SAME TIME as the Filled in customer survey:  113
Number of rows in which the Signed up for a demo happened AFTER the Filled in customer survey:  0


In [57]:
col_1 = 'Signed up for a demo'
col_2 = 'Date Platform sign up'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Signed up for a demo happened BEFORE the Date Platform sign up:  48
Number of rows in which the Signed up for a demo happened AT THE SAME TIME as the Date Platform sign up:  37
Number of rows in which the Signed up for a demo happened AFTER the Date Platform sign up:  3


In [58]:
col_1 = 'Signed up for a demo'
col_2 = 'Account Manager assigned'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Signed up for a demo happened BEFORE the Account Manager assigned:  47
Number of rows in which the Signed up for a demo happened AT THE SAME TIME as the Account Manager assigned:  0
Number of rows in which the Signed up for a demo happened AFTER the Account Manager assigned:  0


In [59]:
col_1 = 'Signed up for a demo'
col_2 = 'Subscribed'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Signed up for a demo happened BEFORE the Subscribed:  34
Number of rows in which the Signed up for a demo happened AT THE SAME TIME as the Subscribed:  0
Number of rows in which the Signed up for a demo happened AFTER the Subscribed:  0


Looking at the above results, it seems that the **Signed up for a demo** step needs to happen before (or on the same day as) the next steps:
- Filled in customer survey
- Did sign up to the platform (in this case there are three outliers)
- Account Manager assigned
- Subscribed

# Date features analysis - Filled in customer survey

In [60]:
col_1 = 'Filled in customer survey'
col_2 = 'Date Platform sign up'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Filled in customer survey happened BEFORE the Date Platform sign up:  37
Number of rows in which the Filled in customer survey happened AT THE SAME TIME as the Date Platform sign up:  48
Number of rows in which the Filled in customer survey happened AFTER the Date Platform sign up:  2


In [61]:
col_1 = 'Filled in customer survey'
col_2 = 'Account Manager assigned'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Filled in customer survey happened BEFORE the Account Manager assigned:  47
Number of rows in which the Filled in customer survey happened AT THE SAME TIME as the Account Manager assigned:  0
Number of rows in which the Filled in customer survey happened AFTER the Account Manager assigned:  0


In [62]:
col_1 = 'Filled in customer survey'
col_2 = 'Subscribed'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Filled in customer survey happened BEFORE the Subscribed:  34
Number of rows in which the Filled in customer survey happened AT THE SAME TIME as the Subscribed:  0
Number of rows in which the Filled in customer survey happened AFTER the Subscribed:  0


Once again, it seems that the customer survey is filled before the next steps - except for the Platform signup that has 2 outliers

# Date features analysis - Did sign up to the platform

In [63]:
col_1 = 'Date Platform sign up'
col_2 = 'Account Manager assigned'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Date Platform sign up happened BEFORE the Account Manager assigned:  62
Number of rows in which the Date Platform sign up happened AT THE SAME TIME as the Account Manager assigned:  7
Number of rows in which the Date Platform sign up happened AFTER the Account Manager assigned:  0


In [64]:
col_1 = 'Date Platform sign up'
col_2 = 'Subscribed'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Date Platform sign up happened BEFORE the Subscribed:  46
Number of rows in which the Date Platform sign up happened AT THE SAME TIME as the Subscribed:  2
Number of rows in which the Date Platform sign up happened AFTER the Subscribed:  0


# Date features analysis - Account Manager assigned

In [65]:
col_1 = 'Account Manager assigned'
col_2 = 'Subscribed'

print(f'Number of rows in which the {col_1} happened BEFORE the {col_2}: ', (parsed_data[col_1] < parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AT THE SAME TIME as the {col_2}: ', (parsed_data[col_1] == parsed_data[col_2]).sum())
print(f'Number of rows in which the {col_1} happened AFTER the {col_2}: ', (parsed_data[col_1] > parsed_data[col_2]).sum())

Number of rows in which the Account Manager assigned happened BEFORE the Subscribed:  33
Number of rows in which the Account Manager assigned happened AT THE SAME TIME as the Subscribed:  3
Number of rows in which the Account Manager assigned happened AFTER the Subscribed:  11


# Conclusions

It seems that the order of the steps is not clearly defined.

However, in general, we can say that the order is:
- 'Signed up for a demo'
- 'Filled in customer survey'
- 'First Call'
- 'Date Platform sign up'
- 'Account Manager assigned'
- 'Subscribed'

The onboarding steps will also be part of the **state**. We will encode them as follow:

a (6,) binary vector that indicates whether a step was done or not;

- The starting point is [0,0,0,0,0,0]
- [1,0,0,0,0,0] and [0,1,0,0,0,0] are both valid first steps
- [1,1,0,0,0,0] is needed to move to [1,1,1,0,0,0]
- [1,1,1,0,0,0] is needed to move to [1,1,1,1,0,0]
- From [1,1,1,1,0,0] it is possible to go in both [1,1,1,1,1,0] and [1,1,1,1,0,1]
- [1,1,1,1,1,1] is the desired state

To summarize, the state will be made of the **concatenation** of the following arrays:

- One-hot encoding of the Country - **cannot** be changed - len 78
- One-hot encoding of the Education - **cannot** be changed - len 30
- Int for the Status - it can change - len 1
- Binary encoding of the Onboarding steps - it can change - len 6

Total length of the state is: 78 + 30 + 1  + 6 = 115

# State space creation

Then to convert the status from string to int

In [66]:
data_to_use['Int status'] = 0
data_to_use['Int status'][data_to_use['Status'] == '1st message'] = 1
data_to_use['Int status'][data_to_use['Status'] == '2nd message'] = 2
data_to_use['Int status'][data_to_use['Status'] == '3rd message'] = 3



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



We would actually like to create a **new boolean feature** that indicates whether the onboarding process has started or not

In [67]:
mask = parsed_data['First step'].notna()
data_to_use['OB started'] = 0
data_to_use['OB started'][mask] = 1



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Regarding the onboarding, we want to know:
- Which steps where made
- In which order they were made

Hence, we are going to create a dataframe with 6 columns, the number of onboarding steps.

In each row, we are going to fill column by column the numerical encoding of the onboarding step that was performed.

If none were performed, because the client exited the onbaording or because they subscribed, the cell will be filled with -1

In this way, moving along the columns of the new dataframe, we can know which onbaording step was done next.

Note that indexes in python start with 0, but we want to start with 1.

So we are going to create the onboarding DataFrame with the indexes, then add +1 to all cells.

Hence, we originally need to set the missing steps value to -2, which will then become -1

In [68]:
# This is a function that returns the index of the column with the earliest date
def earliest_date_index(x):

    non_nat_indices = np.where(~pd.isna(x))[0]

    if non_nat_indices.size > 0:
        earliest_index = min(non_nat_indices, key=lambda i: x[i])
        return earliest_index
    else:
        return -2

# Define a function to set the value in the specified column to NaN
def set_value_to_nan(row, step):

  # get the column index of the first step
  step_index = int(step[row.name])

  # If index is -2 it means that the whole row is already nan
  if step_index != -2:
    column_to_set_nan = ob_new_order.columns[step_index]
    row[column_to_set_nan] = np.nan
  return row

For future use, we would like to columns to be in the proper order

In [69]:
new_column_order = ['Signed up for a demo',
                    'Filled in customer survey',
                    'First Call',
                    'Date Platform sign up',
                    'Account Manager assigned',
                    'Subscribed']

In [70]:
# Create a copy that will be modified:
# each iteration, we will "delete" the date of the previously detected step
ob_new_order = onboarding_only[new_column_order].copy()

In [71]:
ob_copy = ob_new_order.copy()

# get the first step for each row, if any
first_step = ob_copy.apply(earliest_date_index, axis=1)
old_step = first_step.copy()
# Create the OB steps dataframe, starting with the first step
ob_steps = first_step.copy()

# Fill in the dataframe
for i in range(5):
  ob_copy = ob_copy.apply(set_value_to_nan, axis=1, args=(old_step,))
  new_step = ob_copy.apply(earliest_date_index, axis=1)
  ob_steps = pd.concat([ob_steps, new_step], axis=1, ignore_index=True)
  old_step = new_step

In [72]:
mask = (ob_steps.iloc[:, -1] != -2) & ~(ob_steps.iloc[:, -1] == 5)

In [73]:
parsed_data[mask]

Unnamed: 0_level_0,Country,Education,First Contact,Last Contact,Status,Stage,First Call,Signed up for a demo,Filled in customer survey,Date Platform sign up,Account Manager assigned,Subscribed,Bool Platform sign up,First step,Last step
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
141,mexico,B1,NaT,NaT,1st message,subscribed already,2021-12-22,2021-12-20,2021-12-20,2021-12-21,2022-01-03,2021-12-30,1.0,2021-12-20,2022-01-03
220,canada,B27,NaT,NaT,1st message,subscribed already,2021-10-26,2021-10-25,2021-10-26,2021-10-28,2021-12-28,2021-10-28,1.0,2021-10-25,2021-12-28
7852,usa,B27,2021-05-14,2021-09-15,1st message,subscribed already,2021-09-21,2021-09-15,2021-09-16,2021-09-25,2021-10-01,2021-09-30,1.0,2021-09-15,2021-10-01
8070,usa,B27,2021-06-04,2021-09-22,1st message,subscribed already,2021-09-30,2021-09-23,2021-09-26,2021-09-27,2021-10-01,2021-09-27,1.0,2021-09-23,2021-10-01
8405,usa,B27,2021-06-20,2021-08-30,1st message,subscribed already,2021-08-31,2021-08-31,2021-09-01,2021-09-01,2021-09-06,2021-09-05,1.0,2021-08-31,2021-09-06


There are some rare cases in which the customer is successfully acquired, and the Account Manager is assigned **after** the subscription.

Although some of these cases are exceptions (e.g. 30 Dec - 3 Jan), it's still advisable to promptly assign an account manager to the customer.

Since these are limited cases, and the final goal is the subscription, we will "delete" the fact that the account manager was assigned after the subscription buy deleting the record of it.

In [74]:
ob_steps.loc[mask, ob_steps.columns[5]] = -2

Let's now add +1 to the created dataframe, and add it to the data_to_use dataframe

In [75]:
ob_steps = ob_steps + 1
data_to_use = pd.concat([data_to_use, ob_steps], axis=1)

As last steps we **one hot encode** the country and education features.

We do this in the last step to obtain the columns in the desired order.
Indeed, Country and Education are state space components that cannot be changed by any action, so we just leave them at the end of the array

In [76]:
data_to_use = pd.get_dummies(data_to_use, columns=['Country'], prefix='Country')
data_to_use = pd.get_dummies(data_to_use, columns=['Education'], prefix='Education')

Drop the old columns

In [77]:
data_to_use.drop(columns=['First Call', 'Signed up for a demo', 'Filled in customer survey', 'Date Platform sign up', 'Bool Platform sign up', 'Status', 'Account Manager assigned', 'Subscribed'], inplace=True)

In [78]:
data_to_use

Unnamed: 0_level_0,Int status,OB started,0,1,2,3,4,5,Country_algeria,Country_argentina,...,Education_B28,Education_B29,Education_B3,Education_B30,Education_B4,Education_B5,Education_B6,Education_B7,Education_B8,Education_B9
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,1,4,-1,-1,-1,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,2,4,3,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,-1,-1,-1,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,2,3,4,5,-1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,1,1,2,3,-1,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11789,1,0,-1,-1,-1,-1,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,0
11790,1,0,-1,-1,-1,-1,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,0
11791,1,0,-1,-1,-1,-1,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,1
11792,1,0,-1,-1,-1,-1,-1,-1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [79]:
with open('/content/gdrive/MyDrive/P7_files/data_to_use.pkl', 'wb') as handle:
  pickle.dump(data_to_use, handle)

# Action space

The possible actions that can be done are:
- Contact the potential customer
- Do the next onboarding step
- Do nothing

Each action can lead to **multiple** new states. We need to identify all the possible transitions and their **probability**.

Of course the action "do nothing" leave the state unchanged.

Since we don't have additional information, we will estimate the transition probability based on the available data.

Let's now analyze action by action and model all the probabilities.

Note that, as already stated, the country and education will always be **unchanged**, hence the actions will only act on the first 7 columns of the data to use



In [80]:
actions = [['contact', 'nothing'],
           ['next', 'nothing']]

In [81]:
trans_probs = {}

# Action space - Contact potential customer

Since we dropped few columns, the effect of the action "Contact potential customer" is:

- Increase "Int status" by 1 with probability 1
- It **might** change the onboarding process step

Now we need to compute the probabilities that the first step happens for each initial state

In [92]:
mask = (parsed_data['Status'] == '1st message')
x = ob_steps.loc[mask, 0].value_counts() / ob_steps.loc[mask, 0].shape[0]
x

-1    0.954834
 1    0.028498
 3    0.010754
 4    0.005915
Name: 0, dtype: float64

In [93]:
trans_probs['1st mesg'] = {'-1': x[-1],
                '1': x[1],
                '2': 0,
                '3': x[3],
                '4': x[4],
                '5': 0,
                '6': 0}

In [94]:
mask = (parsed_data['Status'] == '2nd message')
x = ob_steps.loc[mask, 0].value_counts() / ob_steps.loc[mask, 0].shape[0]
x

-1    0.801762
 3    0.189427
 1    0.004405
 4    0.004405
Name: 0, dtype: float64

In [95]:
trans_probs['1st mesg'] = {'-1': x[-1],
                '1': x[1],
                '2': 0,
                '3': x[3],
                '4': x[4],
                '5': 0,
                '6': 0}

In [96]:
mask = (parsed_data['Status'] == '3rd message')
x = ob_steps.loc[mask, 0].value_counts() / ob_steps.loc[mask, 0].shape[0]
x

-1    0.933009
 3    0.058465
 4    0.007308
 1    0.001218
Name: 0, dtype: float64

In [97]:
trans_probs['1st mesg'] = {'-1': x[-1],
                '1': x[1],
                '2': 0,
                '3': x[3],
                '4': x[4],
                '5': 0,
                '6': 0}

In [98]:
trans_probs

{'1st mesg': {'-1': 0.9330085261875761,
  '1': 0.001218026796589525,
  '2': 0,
  '3': 0.058465286236297195,
  '4': 0.007308160779537149,
  '5': 0,
  '6': 0}}

# Next Onboarding step - 2nd

above we modeled the probabilities to go from nothing to the **first onboarding step**

Now we need to model the transitions inside the onboarding itself.

In [112]:
mask = (ob_steps.loc[:, 0] == 1)
x = ob_steps.loc[mask, 1].value_counts() / ob_steps.loc[mask, 1].shape[0]
x

 2    0.637037
-1    0.355556
 3    0.007407
Name: 1, dtype: float64

In [113]:
trans_probs['[1]'] = {'-1': x[-1],
                '1': 0,
                '2': x[2],
                '3': x[3],
                '4': 0,
                '5': 0,
                '6': 0}

In [114]:
mask = (ob_steps.loc[:, 0] == 2)
x = ob_steps.loc[mask, 1].value_counts() / ob_steps.loc[mask, 1].shape[0]
x

Series([], Name: 1, dtype: float64)

In [115]:
mask = (ob_steps.loc[:, 0] == 3)
x = ob_steps.loc[mask, 1].value_counts() / ob_steps.loc[mask, 1].shape[0]
x

-1    0.51250
 4    0.45625
 1    0.03125
Name: 1, dtype: float64

In [116]:
trans_probs['[3]'] = {'-1': x[-1],
                '1': x[1],
                '2': 0,
                '3': 0,
                '4': x[4],
                '5': 0,
                '6': 0}

In [117]:
mask = (ob_steps.loc[:, 0] == 4)
x = ob_steps.loc[mask, 1].value_counts() / ob_steps.loc[mask, 1].shape[0]
x

-1    0.692308
 3    0.246154
 1    0.046154
 5    0.015385
Name: 1, dtype: float64

In [118]:
trans_probs['[4]'] = {'-1': x[-1],
                '1': x[1],
                '2': 0,
                '3': x[3],
                '4': 0,
                '5': x[5],
                '6': 0}

In [119]:
mask = (ob_steps.loc[:, 0] == 5)
x = ob_steps.loc[mask, 1].value_counts() / ob_steps.loc[mask, 1].shape[0]
x

Series([], Name: 1, dtype: float64)

In [120]:
mask = (ob_steps.loc[:, 0] == 6)
x = ob_steps.loc[mask, 1].value_counts() / ob_steps.loc[mask, 1].shape[0]
x

Series([], Name: 1, dtype: float64)

# Next Onboarding step - 3rd

Above we did transition from first step -> second step.

Now we need to find the probabilities second step -> third step.

Permutations with **First step = 1**

In [121]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2)
x = ob_steps.loc[mask, 2].value_counts() / ob_steps.loc[mask, 2].shape[0]
x

 3    0.360465
-1    0.343023
 4    0.296512
Name: 2, dtype: float64

In [122]:
trans_probs['[1, 2]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': x[3],
                '4': x[4],
                '5': 0,
                '6': 0}

In [123]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 3)
x = ob_steps.loc[mask, 2].value_counts() / ob_steps.loc[mask, 2].shape[0]
x

-1    0.5
 2    0.5
Name: 2, dtype: float64

In [124]:
trans_probs['[1, 3]'] = {'-1': x[-1],
                '1': 0,
                '2': x[2],
                '3': 0,
                '4': 0,
                '5': 0,
                '6': 0}

First step = 2 never occurs, so we move to **First step = 3**

In [131]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 1)
x = ob_steps.loc[mask, 2].value_counts() / ob_steps.loc[mask, 2].shape[0]
x

 2    0.9
-1    0.1
Name: 2, dtype: float64

In [132]:
trans_probs['[3, 1]'] = {'-1': x[-1],
                '1': 0,
                '2': x[2],
                '3': 0,
                '4': 0,
                '5': 0,
                '6': 0}

In [127]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 4)
x = ob_steps.loc[mask, 2].value_counts() / ob_steps.loc[mask, 2].shape[0]
x

-1    0.863014
 5    0.089041
 6    0.047945
Name: 2, dtype: float64

In [128]:
trans_probs['[3, 4]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': x[5],
                '6': x[6]}

And the last possible first step is **First step = 4**

In [139]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 1)
x = ob_steps.loc[mask, 2].value_counts() / ob_steps.loc[mask, 2].shape[0]
x

 2    0.666667
-1    0.333333
Name: 2, dtype: float64

In [140]:
trans_probs['[4, 1]'] = {'-1': x[-1],
                '1': 0,
                '2': x[2],
                '3': 0,
                '4': 0,
                '5': 0,
                '6': 0}

In [134]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 3)
x = ob_steps.loc[mask, 2].value_counts() / ob_steps.loc[mask, 2].shape[0]
x

-1    0.6875
 5    0.2500
 6    0.0625
Name: 2, dtype: float64

In [135]:
trans_probs['[4, 3]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': x[5],
                '6': x[6]}

In [141]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 5)
x = ob_steps.loc[mask, 2].value_counts() / ob_steps.loc[mask, 2].shape[0]
x

-1    1.0
Name: 2, dtype: float64

In [142]:
trans_probs['[4, 5]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': 0}

In [143]:
trans_probs

{'1st mesg': {'-1': 0.9330085261875761,
  '1': 0.001218026796589525,
  '2': 0,
  '3': 0.058465286236297195,
  '4': 0.007308160779537149,
  '5': 0,
  '6': 0},
 '[1]': {'-1': 0.35555555555555557,
  '1': 0,
  '2': 0.6370370370370371,
  '3': 0.007407407407407408,
  '4': 0,
  '5': 0,
  '6': 0},
 '[3]': {'-1': 0.5125,
  '1': 0.03125,
  '2': 0,
  '3': 0,
  '4': 0.45625,
  '5': 0,
  '6': 0},
 '[4]': {'-1': 0.6923076923076923,
  '1': 0.046153846153846156,
  '2': 0,
  '3': 0.24615384615384617,
  '4': 0,
  '5': 0.015384615384615385,
  '6': 0},
 '[1, 2]': {'-1': 0.3430232558139535,
  '1': 0,
  '2': 0,
  '3': 0.36046511627906974,
  '4': 0.29651162790697677,
  '5': 0,
  '6': 0},
 '[1, 3]': {'-1': 0.5, '1': 0, '2': 0.5, '3': 0, '4': 0, '5': 0, '6': 0},
 '[3, 1]': {'-1': 0.1, '1': 0, '2': 0.9, '3': 0, '4': 0, '5': 0, '6': 0},
 '[3, 4]': {'-1': 0.863013698630137,
  '1': 0,
  '2': 0,
  '3': 0,
  '4': 0,
  '5': 0.08904109589041095,
  '6': 0.04794520547945205},
 '[4, 1]': {'-1': 0.3333333333333333,
  '1':

# Next Onboarding step - 4th

Above we did transition from second step -> third step.

Now we need to find the probabilities third step -> fourth step.

There are quite a few combinations for the first three steps:
- 1 2 3
- 1 2 4
- 1 3 2
- 3 1 2
- 3 4 5
- 3 4 6
- 4 1 2
- 4 3 5
- 4 3 6

This is an additional sign that suggests to revise the internal procedure for the onboarding.

Anyway, let's patiently develop each of these combinations

In [144]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 3)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

-1    0.580645
 4    0.387097
 5    0.032258
Name: 3, dtype: float64

In [145]:
trans_probs['[1, 2, 3]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': x[4],
                '5': x[5],
                '6': 0}

In [146]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 4)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

 3    0.745098
-1    0.215686
 5    0.019608
 6    0.019608
Name: 3, dtype: float64

In [147]:
trans_probs['[1, 2, 4]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': x[3],
                '4': 0,
                '5': x[5],
                '6': x[6]}

In [148]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 3) & (ob_steps.loc[:, 2] == 2)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

4    1.0
Name: 3, dtype: float64

In [149]:
trans_probs['[1, 3, 2]'] = {'-1': 0,
                '1': 0,
                '2': 0,
                '3': 0,
                '4': x[4],
                '5': 0,
                '6': 0}

In [151]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 1) & (ob_steps.loc[:, 2] == 2)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

4    1.0
Name: 3, dtype: float64

In [154]:
trans_probs['[3, 1, 2]'] = {'-1': 0,
                '1': 0,
                '2': 0,
                '3': 0,
                '4': x[4],
                '5': 0,
                '6': 0}

In [155]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 4) & (ob_steps.loc[:, 2] == 5)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

 6    0.538462
-1    0.461538
Name: 3, dtype: float64

In [157]:
trans_probs['[3, 4, 5]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

In [163]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 4) & (ob_steps.loc[:, 2] == 6)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

 5    0.714286
-1    0.285714
Name: 3, dtype: float64

In [164]:
ob_steps.loc[mask, 3]

ID
285     5
287    -1
366     5
374     5
5617    5
6725    5
7875   -1
Name: 3, dtype: int64

Once again we observe some rare cases in which the Account manager was assigned after the subscription.

Let's remove it & manually fix the transition probabilities

In [168]:
data_to_use.loc[mask, 3] = -1

In [170]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 1) & (ob_steps.loc[:, 2] == 2)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

3    1.0
Name: 3, dtype: float64

In [172]:
trans_probs['[4, 1, 2]'] = {'-1': 0,
                '1': 0,
                '2': 0,
                '3': x[3],
                '4': 0,
                '5': 0,
                '6': 0}

In [173]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 3) & (ob_steps.loc[:, 2] == 5)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

-1    0.75
 6    0.25
Name: 3, dtype: float64

In [174]:
trans_probs['[4, 3, 5]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

In [175]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 3) & (ob_steps.loc[:, 2] == 6)
x = ob_steps.loc[mask, 3].value_counts() / ob_steps.loc[mask, 3].shape[0]
x

5    1.0
Name: 3, dtype: float64

Once again another case of AM assigned after subscription

In [176]:
ob_steps.loc[mask, 3]

ID
644    5
Name: 3, dtype: int64

In [177]:
data_to_use.loc[mask, 3] = -1

# Next Onboarding step - 5th

Above we did transition from third step -> fourth step.

Now we need to find the probabilities fourth step -> fifth step.

Once again, let's identify all the possible combinations and exclude the ones at the end of run

- 1 2 3 4
- 1 2 3 5
- 1 2 4 3
- 1 2 4 5
- 1 2 4 6 -> EOR
- 1 3 2 4
- 3 1 2 4
- 3 4 5 6 -> EOR
- 4 1 2 3
- 4 3 5 6 -> EOR

In [178]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 3) & (ob_steps.loc[:, 3] == 4)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

 5    0.625
-1    0.250
 6    0.125
Name: 4, dtype: float64

In [179]:
trans_probs['[1, 2, 3, 4]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': x[5],
                '6': x[6]}

In [180]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 3) & (ob_steps.loc[:, 3] == 5)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

6    1.0
Name: 4, dtype: float64

In [181]:
trans_probs['[1, 2, 3, 5]'] = {'-1': 0,
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

In [182]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 4) & (ob_steps.loc[:, 3] == 3)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

-1    0.526316
 5    0.447368
 6    0.026316
Name: 4, dtype: float64

In [183]:
trans_probs['[1, 2, 4, 3]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': x[5],
                '6': x[6]}

In [184]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 4) & (ob_steps.loc[:, 3] == 5)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

-1    1.0
Name: 4, dtype: float64

In [185]:
trans_probs['[1, 2, 4, 5]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': 0}

In [186]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 4) & (ob_steps.loc[:, 3] == 6)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

3    1.0
Name: 4, dtype: float64

In [187]:
ob_steps.loc[mask]

Unnamed: 0_level_0,0,1,2,3,4,5
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
8070,1,2,4,6,3,-1


let's also fix this

In [188]:
data_to_use.loc[mask, 4] = -1

In [189]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 3) & (ob_steps.loc[:, 2] == 2) & (ob_steps.loc[:, 3] == 4)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

6    1.0
Name: 4, dtype: float64

In [190]:
trans_probs['[1, 3, 2, 4]'] = {'-1': 0,
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

In [191]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 1) & (ob_steps.loc[:, 2] == 2) & (ob_steps.loc[:, 3] == 4)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

 5    0.666667
-1    0.333333
Name: 4, dtype: float64

In [192]:
trans_probs['[3, 1, 2, 4]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': x[5],
                '6': 0}

In [193]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 4) & (ob_steps.loc[:, 2] == 5) & (ob_steps.loc[:, 3] == 6)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

-1    1.0
Name: 4, dtype: float64

In [194]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 1) & (ob_steps.loc[:, 2] == 2) & (ob_steps.loc[:, 3] == 3)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

 5    0.5
-1    0.5
Name: 4, dtype: float64

In [195]:
trans_probs['[4, 1, 2, 3]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': x[5],
                '6': 0}

In [196]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 3) & (ob_steps.loc[:, 2] == 5) & (ob_steps.loc[:, 3] == 6)
x = ob_steps.loc[mask, 4].value_counts() / ob_steps.loc[mask, 4].shape[0]
x

-1    1.0
Name: 4, dtype: float64

# Next Onboarding step - 6th

The sixth step is the **last one**.

Hence, the next state should be either 'Subscribed' or 'not subscribed' and end of run.

There are **four** combinations that made it so far:
- 1 2 3 4 5
- 1 2 4 3 5
- 3 1 2 4 5
- 4 1 2 3 5

let's finish them out

In [197]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 3) & (ob_steps.loc[:, 3] == 4) & (ob_steps.loc[:, 4] == 5)
x = ob_steps.loc[mask, 5].value_counts() / ob_steps.loc[mask, 5].shape[0]
x

 6    0.533333
-1    0.466667
Name: 5, dtype: float64

In [198]:
trans_probs['[1, 2, 3, 4, 5]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

In [199]:
mask = (ob_steps.loc[:, 0] == 1) & (ob_steps.loc[:, 1] == 2) & (ob_steps.loc[:, 2] == 4) & (ob_steps.loc[:, 3] == 3) & (ob_steps.loc[:, 4] == 5)
x = ob_steps.loc[mask, 5].value_counts() / ob_steps.loc[mask, 5].shape[0]
x

 6    0.764706
-1    0.235294
Name: 5, dtype: float64

In [200]:
trans_probs['[1, 2, 4, 3, 5]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

In [201]:
mask = (ob_steps.loc[:, 0] == 3) & (ob_steps.loc[:, 1] == 1) & (ob_steps.loc[:, 2] == 2) & (ob_steps.loc[:, 3] == 4) & (ob_steps.loc[:, 4] == 5)
x = ob_steps.loc[mask, 5].value_counts() / ob_steps.loc[mask, 5].shape[0]
x

 6    0.666667
-1    0.333333
Name: 5, dtype: float64

In [202]:
trans_probs['[3, 1, 2, 4, 5]'] = {'-1': x[-1],
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

In [203]:
mask = (ob_steps.loc[:, 0] == 4) & (ob_steps.loc[:, 1] == 1) & (ob_steps.loc[:, 2] == 2) & (ob_steps.loc[:, 3] == 3) & (ob_steps.loc[:, 4] == 5)
x = ob_steps.loc[mask, 5].value_counts() / ob_steps.loc[mask, 5].shape[0]
x

6    1.0
Name: 5, dtype: float64

In [204]:
trans_probs['[4, 1, 2, 3, 5]'] = {'-1': 0,
                '1': 0,
                '2': 0,
                '3': 0,
                '4': 0,
                '5': 0,
                '6': x[6]}

Now we should have **all the transition probabilities**

In [206]:
# We modified the data to use
with open('/content/gdrive/MyDrive/P7_files/data_to_use.pkl', 'wb') as handle:
  pickle.dump(data_to_use, handle)

with open('/content/gdrive/MyDrive/P7_files/trans_probs.pkl', 'wb') as handle:
  pickle.dump(trans_probs, handle)

# Rewards

In [None]:
data_to_use

Normally, in Reinforcement Learning the reward is based on the **state action pair**.

However, in our case, what really matters is the **state only**.

Hence, the reward - and the approximating Q function - will be based on state only.

Our primary goal is to obtain a **subscription**, so we define a reward for state 6

Our secondary goal is to **start the onboarding**, so we define also a reward for when column "OB started" changes from 0 to 1

Moreover, we **don't want** potential customers to interrupt the onboarding process.
Hence, we are defining a small negative reward if the onboarding is stopped (next state value -1)

Lastly, the data we have show max 3 contacts per potential customer.
In reality there could be more, but we don't have the data to model this. In any case, we don't want to penalize if after 3 calls the OB is not started yet - because it might still start with additional contacts.

Hence, if after the 3rd contact there is no OB, we assign a **null reward**

In [None]:
reward_sub = 10
reward_ob = 2     # we set it equal to 2 so that start and stop OB is still better (+1) than no OB (0)
stop_ob = -1

Note that if the customer subscribes, or interrupts the OB - also the **run ends**.

However, if a customer starts the OB, the run continues.

Since the reward is based only on state, we need to make sure to assign a reward for the start of the OB **only once**. We are going to use a flag in the DQN algorithm we are going to develop.