# Selection for States and Actions

## States

I chose to utilize the user data as states. In my initial attempt to create this RL model I attempted to utilize all user data as is (I combined them so it was one state per user). However, there were a few issues that this created. The first one I noticed was that it took way too long, I waited 3 whole hours and nothing was completed. Realizing that there were over 2 million data points I looked for a new solution. I concatonated each user so if we had 2 M-B-22-12 the would only be run once. This helped a little and I was able to get some good data coming out of the model, but a new issue was noticed, if I hadn't filled the q table with all possibilities of the states I could not querry the table with a "new" state. I could have handled this by just finding the closest state and selecting that, however I instead decided to combine age and tenure into groupings. Not only did this make it more querriable for any new state that can come in, but it also made it much easier to test changes in the data and see results. So in total the states I chose were based on user data in the format {gender}-{typed}-{age_group}-{tenure_group}.

## Actions

The actions were much simpler to determine. As we needed the RL model to determine the best email to send to get a response there was only one thing that could be considered action, the types of emails it can send out. Thus the actions are 1-3 as each type of email.


# Selection for Learning rate, the Discount factor, and Epsilon

For each of these items I chose them based on two things. The first iterations I chose based on what they do, this will be talked about in each section. The second way I attempted to choose the best number is through a series of loopings. The first looping chose these by looping through all possible values I could think for each of these categories. I then ran a second looping without any epochs just to see if I could determine a good setup.

alpha 0.5 gamma 0.99 epsilon 0.5
alpha 0.2 gamma 0.75 epsilon 0.8
alpha 0.2 gamma 0.95 epsilon 0.8
alpha 0.05 gamma 0.99 epsilon 0.5
alpha 0.05 gamma 0.95 epsilon 0.8
alpha 0.05 gamma 0.9 epsilon 0.2

Based on these results as well as my knowledge of these features I chose
alpha: 0.2
gamma: 0.6 (I felt that .9 was too high, also running the q_table generation .9 was very finicky compared to .6)
epsilon: 0.2

## Learning Rate (alpha):
As learning rate determines how quickly the agent updates its Q-value I started with a very moderate value of 0.1. This ensured that it wouldn't be too much, especially since the data had a lot of variability based on the groups I created.

## Discount Factor (gamma):
This affects how future rewards are valued. I felt the data may benefit from this so I chose something a little higher at 0.6. I was not as sure on this one as others so I started with the same value taht was utilized in pick and drop game.

## Epsilon:
Epsilon determines the balance between exploration and exploitation. I chose a relatively low one to start out with, however this may not have been as useful as I thought it may have been as later data showed higher epsilons seemed to do better.


In [85]:
import numpy as np
import pandas as pd
import pickle

class EmailCampaign:
    def __init__(self, training_data=None, actions=None, load_from=None):
        if load_from:
            self.load_model(load_from)
        else:
            self.training_data = training_data
            self.actions = actions.unique()  # Assuming actions are passed directly
            self.q_table = np.zeros((len(self.training_data['State'].unique()), len(self.actions)))
            # Creating mappings from state and action to index
            self.state_to_index = {state: idx for idx, state in enumerate(self.training_data['State'].unique())}
            self.action_to_index = {action: idx for idx, action in enumerate(self.actions)}
    
    def update_q_table(self, state, action, reward, alpha, gamma):
        state_idx = self.state_to_index[state]
        action_idx = self.action_to_index[action]
        current_q = self.q_table[state_idx, action_idx]
        next_max_q = np.max(self.q_table[state_idx])  # Assuming no next state is known
        self.q_table[state_idx, action_idx] = current_q + alpha * (reward + gamma * next_max_q - current_q)
        
    def convert_to_state(self, gender, typed, age, tenure):
        age_bins = [18, 25, 35, 50, 65]
        age_labels = ['18-24', '25-34', '35-49', '50-64']
        tenure_bins = [0, 5, 10, 15, 20, 40] 
        tenure_labels = ['0-4', '5-9', '10-14', '15-20', '21+']
        age_group_series = pd.cut([age], bins=age_bins, labels=age_labels, right=False)
        tenure_group_series = pd.cut([tenure], bins=tenure_bins, labels=tenure_labels, right=False)
        age_group = age_group_series[0]
        tenure_group = tenure_group_series[0]
        state = f"{gender}-{typed}-{age_group}-{tenure_group}"
        return state
        
    def save_model(self, filename_prefix):
        # Save the Q-table
        np.save(f'{filename_prefix}_qtable.npy', self.q_table)
        # Save the mappings
        with open(f'{filename_prefix}_state_to_index.pickle', 'wb') as handle:
            pickle.dump(self.state_to_index, handle, protocol=pickle.HIGHEST_PROTOCOL)
        with open(f'{filename_prefix}_action_to_index.pickle', 'wb') as handle:
            pickle.dump(self.action_to_index, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def load_model(self, filename_prefix):
        # Load the Q-table
        self.q_table = np.load(f'{filename_prefix}_qtable.npy')
        # Load the mappings
        with open(f'{filename_prefix}_state_to_index.pickle', 'rb') as handle:
            self.state_to_index = pickle.load(handle)
        with open(f'{filename_prefix}_action_to_index.pickle', 'rb') as handle:
            self.action_to_index = pickle.load(handle)
        self.actions = np.array(list(self.action_to_index.keys()))  # Reconstruct actions array
        
    def get_best_action_for_state(self, state_index):
        # Return the index of the action with the highest Q-value for the given state index
        state_idx = self.state_to_index[state_index]
        best_action_index = np.argmax(self.q_table[state_idx])
        return self.actions[best_action_index]

In [75]:
def reinforcement_solution(campaign, epochs=1000, epsilon=0.2, alpha=0.2, gamma=0.6):
    for epoch in range(epochs):
        shuffled_data = campaign.training_data.sample(frac=1).reset_index(drop=True)
        for _, row in shuffled_data.iterrows():
            state = row['State']
            action = row['SubjectLine_ID']
            sent_count = row['Sent_Count']
            responses_count = row['Responses_Count']
            reward = responses_count / sent_count if sent_count > 0 else 0

            if random.uniform(0, 1) < epsilon:
                action = np.random.choice(campaign.actions)  # Explore
            else:
                state_idx = campaign.state_to_index[state]
                action_idx = np.argmax(campaign.q_table[state_idx])  # Exploit
                action = campaign.actions[action_idx]

            campaign.update_q_table(state, action, reward, alpha, gamma)

In [76]:
from sklearn.model_selection import train_test_split

# Load your data
send_emails_df = pd.read_csv('data/sent_emails.csv')
responded_df = pd.read_csv('data/responded.csv')
userbase_df = pd.read_csv('data/userbase.csv')

# Define bins for Age and Tenure
age_bins = [18, 25, 35, 50, 65]
age_labels = ['18-24', '25-34', '35-49', '50-64']
tenure_bins = [0, 2, 5, 10, 20, 40]  # Adjusting to cover all up to 38 years included
tenure_labels = ['0-4', '5-9', '10-14', '15-20', '21+']

# Apply binning
userbase_df['Age_Group'] = pd.cut(userbase_df['Age'], bins=age_bins, labels=age_labels, right=False)
userbase_df['Tenure_Group'] = pd.cut(userbase_df['Tenure'], bins=tenure_bins, labels=tenure_labels, right=False)

# Update 'State' with binned age and tenure
userbase_df['State'] = userbase_df.apply(lambda x: f"{x['Gender']}-{x['Type']}-{x['Age_Group']}-{x['Tenure_Group']}", axis=1)

# Merge userbase_df to both send_emails_df and responded_df to include 'State' in these dataframes
send_emails_df = send_emails_df.merge(userbase_df[['Customer_ID', 'State']], on='Customer_ID', how='left')
responded_df = responded_df.merge(userbase_df[['Customer_ID', 'State']], on='Customer_ID', how='left')

# Count sent emails per state and subject line
sent_counts = send_emails_df.groupby(['State', 'SubjectLine_ID']).size().reset_index(name='Sent_Count')

# Count responses per state and subject line
response_counts = responded_df.groupby(['State', 'SubjectLine_ID']).size().reset_index(name='Responses_Count')

# Create a full dataset that includes all possible states and subject lines
full_data = userbase_df.drop_duplicates(subset=['State']).merge(send_emails_df[['SubjectLine_ID']].drop_duplicates(), how='cross')

# Merge the counts into the full data
full_data = full_data.merge(sent_counts, on=['State', 'SubjectLine_ID'], how='left')
full_data = full_data.merge(response_counts, on=['State', 'SubjectLine_ID'], how='left')

# Fill NA values for counts where there were no sends or responses
full_data['Sent_Count'].fillna(0, inplace=True)
full_data['Responses_Count'].fillna(0, inplace=True)

In [79]:
campaign = EmailCampaign(full_data, full_data['SubjectLine_ID'])
reinforcement_solution(campaign)
print(campaign.q_table)

[[0.34046401 0.33725394 0.33641605]
 [0.43738977 0.43951924 0.43664308]
 [0.40806808 0.4067911  0.40705028]
 [0.3447459  0.34428463 0.3484866 ]
 [0.33543606 0.34055155 0.3415619 ]
 [0.37554106 0.37767416 0.37403646]
 [0.41312502 0.40848877 0.40727588]
 [0.36545337 0.36664833 0.36121575]
 [0.40508768 0.40224395 0.39524291]
 [0.38672776 0.38219893 0.38657601]
 [0.32003124 0.32247336 0.32210735]
 [0.37019985 0.37107953 0.37023826]
 [0.3690349  0.3672385  0.3717771 ]
 [0.42120024 0.4165972  0.41286995]
 [0.50364323 0.52640683 0.51849794]
 [0.38003362 0.37813956 0.37680293]
 [0.43142116 0.43022354 0.43197506]
 [0.40885554 0.4112021  0.40873554]
 [0.49059089 0.49187807 0.49352727]
 [0.44088759 0.44655103 0.4451798 ]
 [0.38194932 0.38305061 0.38828974]
 [0.43935092 0.4404236  0.44277405]
 [0.38030523 0.38156685 0.38225193]
 [0.36519541 0.36058269 0.36018884]
 [0.43277493 0.43312799 0.4385015 ]
 [0.43041555 0.43641951 0.43301098]
 [0.31770591 0.31372439 0.3151784 ]
 [0.36626618 0.3609916  0.35

In [80]:
# Testing Cases For how the model may be performing
test_point = campaign.convert_to_state("F", "C", 21, 16) #Best outcome 3 Based on Responded csv
print(test_point)
print(campaign.get_best_action_for_state(test_point))

test_point2 = campaign.convert_to_state("M", "B", 36, 12) #Best outcome 1 Based on Responded csv
print(test_point2)
print(campaign.get_best_action_for_state(test_point2))

test_point3 = campaign.convert_to_state("M", "C", 24, 14) #Best outcome 1 Based on Responded csv
print(test_point3)
print(campaign.get_best_action_for_state(test_point3)) 

test_point4 = campaign.convert_to_state("F", "B", 38, 23) #Best outcome 3 Based on Responded csv
print(test_point4)
print(campaign.get_best_action_for_state(test_point4)) 

test_point5 = campaign.convert_to_state("F", "C", 22, 13) #Best outcome 2 Based on Responded csv
print(test_point5)
print(campaign.get_best_action_for_state(test_point5)) 

test_point6 = campaign.convert_to_state("F", "C", 24, 5) #Best outcome 2 Based on Responded csv
print(test_point6)
print(campaign.get_best_action_for_state(test_point6)) 

F-C-18-24-15-20
3
M-B-35-49-10-14
3
M-C-18-24-10-14
1
F-B-35-49-21+
3
F-C-18-24-10-14
2
F-C-18-24-5-9
2


In [82]:
campaign.save_model('email_campaign_model')

In [86]:
#Ensure Model Loads Correctly

campaign = EmailCampaign(load_from='email_campaign_model')

# Testing Cases For how the model may be performing
test_point = campaign.convert_to_state("F", "C", 21, 16) #Best outcome 3 Based on Responded csv
print(test_point)
print(campaign.get_best_action_for_state(test_point))

test_point2 = campaign.convert_to_state("M", "B", 36, 12) #Best outcome 1 Based on Responded csv
print(test_point2)
print(campaign.get_best_action_for_state(test_point2))

test_point3 = campaign.convert_to_state("M", "C", 24, 14) #Best outcome 1 Based on Responded csv
print(test_point3)
print(campaign.get_best_action_for_state(test_point3)) 

test_point4 = campaign.convert_to_state("F", "B", 38, 23) #Best outcome 3 Based on Responded csv
print(test_point4)
print(campaign.get_best_action_for_state(test_point4)) 

test_point5 = campaign.convert_to_state("F", "C", 22, 13) #Best outcome 2 Based on Responded csv
print(test_point5)
print(campaign.get_best_action_for_state(test_point5)) 

test_point6 = campaign.convert_to_state("F", "C", 24, 5) #Best outcome 2 Based on Responded csv
print(test_point6)
print(campaign.get_best_action_for_state(test_point6)) 

F-C-18-24-15-20
3
M-B-35-49-10-14
3
M-C-18-24-10-14
1
F-B-35-49-21+
3
F-C-18-24-10-14
2
F-C-18-24-5-9
2


In [64]:
import numpy as np

# Possible values for alpha, gamma, and epsilon
alphas = [0.01, 0.05, 0.1, 0.2, 0.5]
gammas = [0.5, 0.6, 0.75, 0.9, 0.95, 0.99]
epsilons = [0.8, 0.7, 0.5, 0.3, 0.2, 0.1]

best_results = []
best = [3, 1, 1, 3, 2, 2]

for alpha in alphas:
    for gamma in gammas:
        for epsilon in epsilons:
            # Reinitialize the campaign for each combination
            campaign = EmailCampaign(full_data, full_data['SubjectLine_ID'])
            
            # Train the model
            reinforcement_solution(campaign, epochs=1000, epsilon=epsilon, alpha=alpha, gamma=gamma)
            
            print('alpha', alpha, 'gamma', gamma, 'epsilon', epsilon)
            result = []
            # Testing Cases For how the model may be performing
            test_point = campaign.convert_to_state("F", "C", 21, 16) #Best outcome 3 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point))

            test_point2 = campaign.convert_to_state("M", "B", 36, 12) #Best outcome 1 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point2))

            test_point3 = campaign.convert_to_state("M", "C", 24, 14) #Best outcome 1 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point3)) 

            test_point4 = campaign.convert_to_state("F", "B", 38, 23) #Best outcome 3 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point4)) 

            test_point5 = campaign.convert_to_state("F", "C", 22, 13) #Best outcome 2 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point5))

            test_point6 = campaign.convert_to_state("F", "C", 24, 5) #Best outcome 2 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point6))  
            
            print(result)
            
            if(result == best):
                best_results.append[alpha, gamma, epsilon]
                
print(best_results)

alpha 0.01 gamma 0.5 epsilon 0.8
[2, 1, 2, 2, 1, 1]
alpha 0.01 gamma 0.5 epsilon 0.7
[2, 2, 3, 2, 2, 1]
alpha 0.01 gamma 0.5 epsilon 0.5
[2, 2, 3, 2, 2, 1]
alpha 0.01 gamma 0.5 epsilon 0.3
[1, 3, 3, 2, 2, 2]
alpha 0.01 gamma 0.5 epsilon 0.2
[2, 2, 2, 1, 2, 2]
alpha 0.01 gamma 0.5 epsilon 0.1
[2, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.6 epsilon 0.8
[3, 2, 1, 3, 1, 3]
alpha 0.01 gamma 0.6 epsilon 0.7
[1, 3, 1, 2, 1, 2]
alpha 0.01 gamma 0.6 epsilon 0.5
[1, 2, 1, 2, 1, 2]
alpha 0.01 gamma 0.6 epsilon 0.3
[3, 3, 3, 2, 2, 2]
alpha 0.01 gamma 0.6 epsilon 0.2
[1, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.6 epsilon 0.1
[2, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.75 epsilon 0.8
[3, 2, 3, 2, 2, 2]
alpha 0.01 gamma 0.75 epsilon 0.7
[1, 3, 2, 3, 1, 2]
alpha 0.01 gamma 0.75 epsilon 0.5
[2, 1, 2, 3, 2, 3]
alpha 0.01 gamma 0.75 epsilon 0.3
[2, 2, 2, 3, 1, 3]
alpha 0.01 gamma 0.75 epsilon 0.2
[3, 2, 2, 2, 2, 1]
alpha 0.01 gamma 0.75 epsilon 0.1
[2, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.9 epsilon 0.8
[3, 2, 2, 2, 2, 2]
alpha 

In [65]:
import numpy as np

# Possible values for alpha, gamma, and epsilon
alphas = [0.01, 0.05, 0.1, 0.2, 0.5]
gammas = [0.5, 0.6, 0.75, 0.9, 0.95, 0.99]
epsilons = [0.8, 0.7, 0.5, 0.3, 0.2, 0.1]

best_results = []
best = [3, 1, 1, 3, 2, 2]

for alpha in alphas:
    for gamma in gammas:
        for epsilon in epsilons:
            # Reinitialize the campaign for each combination
            campaign = EmailCampaign(full_data, full_data['SubjectLine_ID'])
            
            # Train the model
            reinforcement_solution(campaign, epochs=1, epsilon=epsilon, alpha=alpha, gamma=gamma)
            
            print('alpha', alpha, 'gamma', gamma, 'epsilon', epsilon)
            result = []
            # Testing Cases For how the model may be performing
            test_point = campaign.convert_to_state("F", "C", 21, 16) #Best outcome 3 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point))

            test_point2 = campaign.convert_to_state("M", "B", 36, 12) #Best outcome 1 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point2))

            test_point3 = campaign.convert_to_state("M", "C", 24, 14) #Best outcome 1 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point3)) 

            test_point4 = campaign.convert_to_state("F", "B", 38, 23) #Best outcome 3 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point4)) 

            test_point5 = campaign.convert_to_state("F", "C", 22, 13) #Best outcome 2 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point5))

            test_point6 = campaign.convert_to_state("F", "C", 24, 5) #Best outcome 2 Based on Responded csv
            result.append(campaign.get_best_action_for_state(test_point6))  
            
            print(result)
            
            if(result == best):
                best_results.append[alpha, gamma, epsilon]
                
print(best_results)

alpha 0.01 gamma 0.5 epsilon 0.8
[2, 3, 3, 2, 1, 2]
alpha 0.01 gamma 0.5 epsilon 0.7
[3, 1, 2, 2, 2, 3]
alpha 0.01 gamma 0.5 epsilon 0.5
[2, 2, 1, 3, 3, 3]
alpha 0.01 gamma 0.5 epsilon 0.3
[2, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.5 epsilon 0.2
[2, 2, 1, 2, 3, 2]
alpha 0.01 gamma 0.5 epsilon 0.1
[2, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.6 epsilon 0.8
[2, 1, 3, 1, 1, 2]
alpha 0.01 gamma 0.6 epsilon 0.7
[1, 1, 1, 2, 1, 1]
alpha 0.01 gamma 0.6 epsilon 0.5
[2, 1, 2, 3, 2, 1]
alpha 0.01 gamma 0.6 epsilon 0.3
[2, 2, 2, 2, 2, 3]
alpha 0.01 gamma 0.6 epsilon 0.2
[2, 2, 2, 2, 1, 2]
alpha 0.01 gamma 0.6 epsilon 0.1
[2, 2, 1, 2, 2, 2]
alpha 0.01 gamma 0.75 epsilon 0.8
[1, 2, 2, 2, 1, 2]
alpha 0.01 gamma 0.75 epsilon 0.7
[2, 2, 2, 1, 2, 2]
alpha 0.01 gamma 0.75 epsilon 0.5
[2, 1, 2, 2, 2, 1]
alpha 0.01 gamma 0.75 epsilon 0.3
[3, 2, 2, 2, 3, 2]
alpha 0.01 gamma 0.75 epsilon 0.2
[2, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.75 epsilon 0.1
[2, 2, 2, 2, 2, 2]
alpha 0.01 gamma 0.9 epsilon 0.8
[3, 2, 2, 3, 1, 2]
alpha 