# Using RL to plan Meals within a budget while catering to personal preferences

## Aim

When shopping for groceriesn there are often multiple options for the same ingredient..some cheaper,some of higher quality.The goal of this project is to develop a model that can select the optimal combination of products for a meal that:

Stays within a specified budget
Aligns with personal preferences

The motivation for using a model is scalability..as the number of ingredients and product options grows the complexity quickly surpasses what can be reasonably calculated mentally.A model allows us to handle larger and ore realistic scenarios efficiently.


## Method

To tackle this problem a simple RL approach is used spcifically Monte Carlo learning to identify the optimal combination of products for a meal.

The model can be framed as a Markov Decision Process (MDP):

Each required ingredient is considered a State.
Each available product for an ingredient represents an Action for that state.
Preferences for products are treated as Individual Rewards which will be detailed further later.

Monte Carlo learning evaluates the overall quality of each sequence of actions by observing the outcome of the complete combination.Unlike some approaches that update step-by-step,Monte Carlo waits until the full sequence is known before assessing the quality of each choice.

While Monte Carlo is often avoided due to the time it takes to simulate full sequences..it is particularly well-suited here. The final evaluation of a meal combination depends on whether the total cost of the selected products stays within the budget.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

In [None]:
data = pd.read_csv("SampleData.csv")

In [4]:
data

In [None]:
def MCModelv1(data, alpha, e, epsilon, budget, reward):
    Ingredients = list(set(data['Ingredient']))
    data = data.copy()  # avoid modifying original
    data['V'] = data['V_0'].copy()

    output = []
    output1 = []
    output2 = []
    action_in_full = []

    for episode_idx in range(e):
        episode_run = []

        for i, ingredient in enumerate(Ingredients):
            # epsilon-greedy selection
            if episode_idx == 0 or np.random.rand() < epsilon:
                # Random selection
                possible_products = data.loc[data['Ingredient'] == ingredient, 'Product']
                chosen_product = np.random.choice(possible_products)
            else:
                # Greedy selection based on max V
                possible_products = data.loc[data['Ingredient'] == ingredient]
                max_v = possible_products['V'].max()
                chosen_product = possible_products[possible_products['V'] == max_v]['Product'].iloc[0]

            episode_run.append(int(chosen_product))

        episode_df = pd.DataFrame({
            'Ingredient': Ingredients,
            'Product': episode_run
        })
        episode_df['Merged_label'] = (episode_df['Ingredient']*10 + episode_df['Product']).astype(float)

        # Merge with cost and reward
        episode_df = episode_df.merge(
            data[['QMerged_label', 'Real_Cost']], 
            left_on='Merged_label', right_on='QMerged_label',
            how='inner'
        )
        # Terminal reward based on budget
        episode_df['Return'] = 1 if budget >= episode_df['Real_Cost'].sum() else -1

        # Update V values
        data = data.merge(
            episode_df[['Merged_label', 'Return']],
            left_on='QMerged_label', right_on='Merged_label',
            how='left'
        )
        data['Return'] = data['Return'].fillna(0)
        data['V'] = data['V'] + alpha * ((data['Return']/len(Ingredients)) - data['V'])

        # Remove temporary columns
        data = data.drop(columns=['Merged_label', 'Return'], errors='ignore')

        # Store outputs
        output.append(data['V'].sum())
        output1.append(data.iloc[[1,2,4,8]]['V'].sum())
        output2.append(data.iloc[[0,3,5,6,7]]['V'].sum())

        # Optimal actions per ingredient
        optimal_actions = data.loc[data.groupby('Ingredient')['V'].idxmax(), ['Ingredient', 'Product']]
        # Randomly select if multiple max
        optimal_actions = optimal_actions.groupby('Ingredient')['Product'].apply(
            lambda x: x.sample(1).iloc[0]
        )
        action_in_full.extend(optimal_actions.astype(int))

    return (
        np.array(output),
        np.array(output1),
        np.array(output2),
        optimal_actions,
        data,
        np.array(action_in_full)
    )


In [None]:
# Define parameters
alpha = 0.1
num_episodes = 100
epsilon = 0.5
budget = 30

# Reward vector (currently all zeros)
reward = [0] * 9

# Measure execution time
start_time = time.time()

# Run Monte Carlo model
Mdl = MCModelv1(
    data=data,
    alpha=alpha,
    e=num_episodes,
    epsilon=epsilon,
    budget=budget,
    reward=reward
)
# Print execution time
print(f"--- {time.time() - start_time:.2f} seconds ---")

In [6]:
print(Mdl[3])

Mdl[4]


In [7]:
plt.plot(range(0,num_episodes), Mdl[0])
plt.title('Sum of V for all Actions at each Episode')
plt.xlabel('Episode')
plt.ylabel('Sum of V')
plt.show()

In [8]:
plt.plot(range(0,num_episodes), Mdl[1],range(0,num_episodes), Mdl[2])
plt.title('Sum of V for the cheapest actions and others seperated at each Episode')
plt.xlabel('Episode')
plt.ylabel('Sum of V')
plt.show()


In [None]:
# Get unique ingredients
Ingredients = list(set(data['Ingredient']))
actions = pd.DataFrame()

# Iterate over each ingredient
for a, ingredient in enumerate(Ingredients):
    # Extract actions selected for this ingredient across all episodes
    individual_actions = Mdl[5][a::len(Ingredients)]  
    actions[a] = individual_actions
    
    # Plot the product selection over episodes
    plt.figure(figsize=(8, 4))
    plt.plot(range(num_episodes), actions[a], marker='o', linestyle='-', markersize=4)
    plt.title(f'Product Selection for Ingredient {ingredient}')
    plt.xlabel('Episode')
    plt.ylabel('Selected Product')
    plt.grid(True)
    plt.show()


In [None]:
# Create adjusted product columns
actions2 = actions.copy()
actions2['Product1'] = actions2.iloc[:, 0] + 10
actions2['Product2'] = actions2.iloc[:, 1] + 20
actions2['Product3'] = actions2.iloc[:, 2] + 30
actions2['Product4'] = actions2.iloc[:, 3] + 40

# Merge with real costs
actions2 = actions2.merge(
    data[['QMerged_label', 'Real_Cost']], 
    left_on='Product1', right_on='QMerged_label', how='left'
).rename(columns={'Real_Cost': 'Cost1'})

actions2 = actions2.merge(
    data[['QMerged_label', 'Real_Cost']], 
    left_on='Product2', right_on='QMerged_label', how='left'
).rename(columns={'Real_Cost': 'Cost2'})

actions2 = actions2.merge(
    data[['QMerged_label', 'Real_Cost']], 
    left_on='Product3', right_on='QMerged_label', how='left'
).rename(columns={'Real_Cost': 'Cost3'})

actions2 = actions2.merge(
    data[['QMerged_label', 'Real_Cost']], 
    left_on='Product4', right_on='QMerged_label', how='left'
).rename(columns={'Real_Cost': 'Cost4'})

# Calculate total cost
actions2['Total_Cost'] = actions2[['Cost1', 'Cost2', 'Cost3', 'Cost4']].sum(axis=1)

# Keep only relevant columns and trim to num_episodes
actions2 = actions2.iloc[:num_episodes, [0, 1, 2, 3, -1]]

# Plot total cost vs budget
plt.figure(figsize=(10, 5))
plt.plot(range(num_episodes), actions2['Total_Cost'], label='Total Cost')
plt.axhline(y=budget, color='k', linestyle='--', linewidth=2, label='Budget')
plt.title('Total Real Cost of Selected Products per Episode')
plt.xlabel('Episode')
plt.ylabel('Total Real Cost (Â£)')
plt.ylim([0, budget + 10])
plt.grid(True)
plt.legend()
plt.show()


In [None]:
# Parameters for very small budget
budget2 = 5
alpha2 = 0.1
num_episodes2 = 100
epsilon2 = 0.5
reward2 = [0] * 9  # Currently all zeros

# Run Monte Carlo model
start_time = time.time()
Mdl2 = MCModelv1(
    data=data,
    alpha=alpha2,
    e=num_episodes2,
    epsilon=epsilon2,
    budget=budget2,
    reward=reward2
)
print(f"--- {time.time() - start_time:.2f} seconds ---")

# Plot sum of V for all actions
plt.figure(figsize=(10, 5))
plt.plot(range(num_episodes2), Mdl2[0], label='Sum of V (All Actions)')
plt.title('Sum of V for All Actions per Episode (Very Small Budget)')
plt.xlabel('Episode')
plt.ylabel('Sum of V')
plt.grid(True)
plt.legend()
plt.show()


In [None]:
# Set parameters
budget3 = 100  # Very large budget
alpha3 = 0.1
num_episodes3 = 100
epsilon3 = 0.5
reward3 = [0] * 9  # Currently all zeros

# Run Monte Carlo model
start_time = time.time()
Mdl3 = MCModelv1(
    data=data,
    alpha=alpha3,
    e=num_episodes3,
    epsilon=epsilon3,
    budget=budget3,
    reward=reward3
)
print(f"--- {time.time() - start_time:.2f} seconds ---")

# Plot sum of V for all actions
plt.figure(figsize=(10, 5))
plt.plot(range(num_episodes3), Mdl3[0], label='Sum of V (All Actions)')
plt.title('Sum of V for All Actions per Episode (Large Budget)')
plt.xlabel('Episode')
plt.ylabel('Sum of V')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
# Set parameters
budget4 = 23
alpha4 = 0.1
num_episodes4 = 100
epsilon4 = 0.5
reward4 = [0] * 9  # Currently all zeros

# Run the Monte Carlo model
start_time = time.time()
Mdl4 = MCModelv1(
    data=data,
    alpha=alpha4,
    e=num_episodes4,
    epsilon=epsilon4,
    budget=budget4,
    reward=reward4
)
print(f"--- {time.time() - start_time:.2f} seconds ---")

# Plot sum of V for all actions
plt.figure(figsize=(10, 5))
plt.plot(range(num_episodes4), Mdl4[0], label='Sum of V (All Actions)')
plt.title('Sum of V for All Actions per Episode')
plt.xlabel('Episode')
plt.ylabel('Sum of V')
plt.grid(True)
plt.legend()
plt.show()

# Plot sum of V for cheapest actions vs others
plt.figure(figsize=(10, 5))
plt.plot(range(num_episodes4), Mdl4[1], label='Cheapest Actions')
plt.plot(range(num_episodes4), Mdl4[2], label='Other Actions')
plt.title('Sum of V for Cheapest Actions vs Others')
plt.xlabel('Episode')
plt.ylabel('Sum of V')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# display Plotly version info
import plotly
print(f"Plotly version: {plotly.__version__}")


In [None]:
# Fixed parameters except alpha
budget5 = 23
num_episodes5 = 100
epsilon5 = 0.5
reward5 = [0] * 9  # Currently all zeros

# Prepare lists to store results
V_list = []
alpha_list = []
episode_list = []

# Loop over alpha values
for x in range(10):
    alpha5 = 1 - x / 10
    Mdl5 = MCModelv1(
        data=data,
        alpha=alpha5,
        e=num_episodes5,
        epsilon=epsilon5,
        budget=budget5,
        reward=reward5
    )
    # Extend the lists with episode results
    V_list.extend(Mdl5[0])
    alpha_list.extend([alpha5] * num_episodes5)
    episode_list.extend(range(num_episodes5))

# Create DataFrame
VforInteractiveGraphA2 = pd.DataFrame({
    'Alpha': alpha_list,
    'Episode': episode_list,
    'V': V_list
})

VforInteractiveGraphA2.head()


In [None]:
# Copy data from previous DataFrame
VforInteractiveGraphA3 = VforInteractiveGraphA2.copy()

# Add extra columns for Plotly visualization
VforInteractiveGraphA3['continent'] = 'Test'
VforInteractiveGraphA3['country'] = 'Test2'
VforInteractiveGraphA3['pop'] = 7000000.0

# Rename columns for consistency
VforInteractiveGraphA3.columns = ['year', 'lifeExp', 'gdpPercap', 'continent', 'country', 'pop']

# Get sorted unique alpha values (years) for slider frames
years = np.sort(VforInteractiveGraphA3['year'].unique())[::-1]
years = np.round(years, 1)

years


In [None]:
# Use VforInteractiveGraphA3 as dataset
dataset = VforInteractiveGraphA3

continents = dataset['continent'].unique().tolist()
years = np.sort(dataset['year'].unique())[::-1]

# Initialize Plotly figure
figure = {
    'data': [],
    'layout': {
        'title': "Parameter Optimisation using Interactive Animation",
        'xaxis': {'title': 'Episode'},
        'yaxis': {'title': 'Sum of V', 'type': 'linear'},
        'hovermode': 'closest',
        'sliders': {},
        'updatemenus': [{
            'buttons': [
                {'args': [None, {'frame': {'duration': 500, 'redraw': False},
                                 'fromcurrent': True,
                                 'transition': {'duration': 300, 'easing': 'quadratic-in-out'}}],
                 'label': 'Play', 'method': 'animate'},
                {'args': [[None], {'frame': {'duration': 0, 'redraw': False},
                                   'mode': 'immediate',
                                   'transition': {'duration': 0}}],
                 'label': 'Pause', 'method': 'animate'}
            ],
            'direction': 'left', 'pad': {'r': 10, 't': 87}, 'showactive': False,
            'type': 'buttons', 'x': 0.1, 'xanchor': 'right', 'y': 0, 'yanchor': 'top'
        }]
    },
    'frames': []
}

# Slider dictionary
sliders_dict = {
    'active': 0,
    'yanchor': 'top',
    'xanchor': 'left',
    'currentvalue': {'font': {'size': 20}, 'prefix': 'Alpha: ', 'visible': True, 'xanchor': 'right'},
    'transition': {'duration': 300, 'easing': 'cubic-in-out'},
    'pad': {'b': 10, 't': 50},
    'len': 0.9,
    'x': 0.1,
    'y': 0,
    'steps': []
}

# Initial frame (first year/alpha)
for continent in continents:
    data_by_cont = dataset[dataset['continent'] == continent]
    figure['data'].append({
        'x': list(data_by_cont['lifeExp']),
        'y': list(data_by_cont['gdpPercap']),
        'mode': 'markers',
        'text': list(data_by_cont['country']),
        'marker': {'sizemode': 'area', 'sizeref': 200000, 'size': list(data_by_cont['pop'])},
        'name': continent
    })

# Create frames and slider steps
for year in years:
    frame = {'data': [], 'name': str(year)}
    for continent in continents:
        data_by_cont = dataset[(dataset['continent'] == continent) & (np.round(dataset['year'], 1) == np.round(year, 1))]
        frame['data'].append({
            'x': list(data_by_cont['lifeExp']),
            'y': list(data_by_cont['gdpPercap']),
            'mode': 'markers',
            'text': list(data_by_cont['country']),
            'marker': {'sizemode': 'area', 'sizeref': 200000, 'size': list(data_by_cont['pop']), 'color': 'rgba(255,182,193,0.9)'},
            'name': continent
        })
    figure['frames'].append(frame)
    sliders_dict['steps'].append({
        'args': [[year], {'frame': {'duration': 300, 'redraw': False}, 'mode': 'immediate', 'transition': {'duration': 300}}],
        'label': str(year),
        'method': 'animate'
    })

figure['layout']['sliders'] = [sliders_dict]

# Display interactive animation
from plotly.offline import iplot
iplot(figure)


In [None]:
# Fixed parameters except alpha
budget6 = 23
num_episodes6 = 100
epsilon6 = 0.5
reward6 = [0] * 9  # Rewards for actions

# Prepare lists for interactive animation
V_list, alpha_list, episode_list = [], [], []

for x in range(10):
    alpha6 = 0.1 - x / 100
    Mdl6 = MCModelv1(
        data=data,
        alpha=alpha6,
        e=num_episodes6,
        epsilon=epsilon6,
        budget=budget6,
        reward=reward6
    )
    
    V_list.extend(Mdl6[0])
    alpha_list.extend([alpha6] * num_episodes6)
    episode_list.extend(range(num_episodes6))

# Create DataFrame for Plotly animation
df_plot = pd.DataFrame({
    'year': alpha_list,      # Alpha values
    'lifeExp': episode_list, # Episode numbers
    'gdpPercap': V_list      # Sum of V
})

# Add bubble simulation columns
df_plot['continent'] = 'Test'
df_plot['country'] = 'Test2'
df_plot['pop'] = 7_000_000.0

# Unique alpha values for slider frames
alpha_values = np.sort(df_plot['year'].unique())[::-1]

# Determine continents
continents = df_plot['continent'].unique().tolist()

# Initialize Plotly figure
figure = {
    'data': [],
    'layout': {
        'title': "Parameter Optimisation using Interactive Animation",
        'xaxis': {'title': 'Episode'},
        'yaxis': {'title': 'Sum of V', 'type': 'linear'},
        'hovermode': 'closest',
        'sliders': {},
        'updatemenus': [{
            'buttons': [
                {'args': [None, {'frame': {'duration': 500, 'redraw': False},
                                 'fromcurrent': True,
                                 'transition': {'duration': 300, 'easing': 'quadratic-in-out'}}],
                 'label': 'Play', 'method': 'animate'},
                {'args': [[None], {'frame': {'duration': 0, 'redraw': False},
                                   'mode': 'immediate',
                                   'transition': {'duration': 0}}],
                 'label': 'Pause', 'method': 'animate'}
            ],
            'direction': 'left', 'pad': {'r': 10, 't': 87}, 'showactive': False,
            'type': 'buttons', 'x': 0.1, 'xanchor': 'right', 'y': 0, 'yanchor': 'top'
        }]
    },
    'frames': []
}

# Slider dictionary
sliders_dict = {
    'active': 0, 'yanchor': 'top', 'xanchor': 'left',
    'currentvalue': {'font': {'size': 20}, 'prefix': 'Alpha: ', 'visible': True, 'xanchor': 'right'},
    'transition': {'duration': 300, 'easing': 'cubic-in-out'},
    'pad': {'b': 10, 't': 50}, 'len': 0.9, 'x': 0.1, 'y': 0, 'steps': []
}

# Initial data for the first frame
for continent in continents:
    data_by_cont = df_plot[df_plot['continent'] == continent]
    figure['data'].append({
        'x': list(data_by_cont['lifeExp']),
        'y': list(data_by_cont['gdpPercap']),
        'mode': 'markers',
        'text': list(data_by_cont['country']),
        'marker': {'sizemode': 'area', 'sizeref': 200_000, 'size': list(data_by_cont['pop'])},
        'name': continent
    })

# Create frames and slider steps
for alpha in alpha_values:
    frame = {'data': [], 'name': str(alpha)}
    for continent in continents:
        data_by_cont = df_plot[(df_plot['continent'] == continent) & (np.round(df_plot['year'], 2) == np.round(alpha, 2))]
        frame['data'].append({
            'x': list(data_by_cont['lifeExp']),
            'y': list(data_by_cont['gdpPercap']),
            'mode': 'markers',
            'text': list(data_by_cont['country']),
            'marker': {'sizemode': 'area', 'sizeref': 200_000, 'size': list(data_by_cont['pop']), 'color': 'rgba(255,182,193,0.9)'},
            'name': continent
        })
    figure['frames'].append(frame)
    sliders_dict['steps'].append({
        'args': [[alpha], {'frame': {'duration': 300, 'redraw': False}, 'mode': 'immediate', 'transition': {'duration': 300}}],
        'label': str(alpha),
        'method': 'animate'
    })

figure['layout']['sliders'] = [sliders_dict]

iplot(figure)


In [None]:
# Fixed parameters except epsilon
budget7 = 23
num_episodes7 = 100
alpha7 = 0.05
reward7 = [0] * 9  # Rewards for actions

# Prepare lists for interactive animation
V_list, epsilon_list, episode_list = [], [], []

for x in range(11):
    epsilon7 = 1 - x / 10
    Mdl7 = MCModelv1(
        data=data,
        alpha=alpha7,
        e=num_episodes7,
        epsilon=epsilon7,
        budget=budget7,
        reward=reward7
    )
    
    # Vectorized append for all episodes
    V_list.extend(Mdl7[0])
    epsilon_list.extend([epsilon7] * num_episodes7)
    episode_list.extend(range(num_episodes7))

# Create DataFrame for Plotly animation
df_plot = pd.DataFrame({
    'year': epsilon_list,      # Epsilon values
    'lifeExp': episode_list,   # Episode number
    'gdpPercap': V_list        # Sum of V
})

# Add bubble simulation columns
df_plot['continent'] = 'Test'
df_plot['country'] = 'Test2'
df_plot['pop'] = 7_000_000.0

# Unique epsilon values for slider frames
epsilon_values = np.sort(df_plot['year'].unique())[::-1]

# Determine continents
continents = df_plot['continent'].unique().tolist()

# Initialize Plotly figure structure
figure = {
    'data': [],
    'layout': {
        'title': "Parameter Optimisation using Interactive Animation",
        'xaxis': {'title': 'Episode'},
        'yaxis': {'title': 'Sum of V', 'type': 'linear'},
        'hovermode': 'closest',
        'sliders': {},
        'updatemenus': [{
            'buttons': [
                {'args': [None, {'frame': {'duration': 500, 'redraw': False},
                                 'fromcurrent': True,
                                 'transition': {'duration': 300, 'easing': 'quadratic-in-out'}}],
                 'label': 'Play', 'method': 'animate'},
                {'args': [[None], {'frame': {'duration': 0, 'redraw': False},
                                   'mode': 'immediate',
                                   'transition': {'duration': 0}}],
                 'label': 'Pause', 'method': 'animate'}
            ],
            'direction': 'left', 'pad': {'r': 10, 't': 87}, 'showactive': False,
            'type': 'buttons', 'x': 0.1, 'xanchor': 'right', 'y': 0, 'yanchor': 'top'
        }]
    },
    'frames': []
}

# Slider dict setup
sliders_dict = {
    'active': 0, 'yanchor': 'top', 'xanchor': 'left',
    'currentvalue': {'font': {'size': 20}, 'prefix': 'Epsilon: ', 'visible': True, 'xanchor': 'right'},
    'transition': {'duration': 300, 'easing': 'cubic-in-out'},
    'pad': {'b': 10, 't': 50}, 'len': 0.9, 'x': 0.1, 'y': 0, 'steps': []
}

# Initial data for first frame
for continent in continents:
    data_by_cont = df_plot[df_plot['continent'] == continent]
    figure['data'].append({
        'x': list(data_by_cont['lifeExp']),
        'y': list(data_by_cont['gdpPercap']),
        'mode': 'markers',
        'text': list(data_by_cont['country']),
        'marker': {'sizemode': 'area', 'sizeref': 200_000, 'size': list(data_by_cont['pop'])},
        'name': continent
    })

# Create frames and slider steps
for epsilon in epsilon_values:
    frame = {'data': [], 'name': str(epsilon)}
    for continent in continents:
        data_by_cont = df_plot[(df_plot['continent'] == continent) & (np.round(df_plot['year'], 1) == np.round(epsilon, 1))]
        frame['data'].append({
            'x': list(data_by_cont['lifeExp']),
            'y': list(data_by_cont['gdpPercap']),
            'mode': 'markers',
            'text': list(data_by_cont['country']),
            'marker': {'sizemode': 'area', 'sizeref': 200_000, 'size': list(data_by_cont['pop']), 'color': 'rgba(255,182,193,0.9)'},
            'name': continent
        })
    figure['frames'].append(frame)
    sliders_dict['steps'].append({
        'args': [[epsilon], {'frame': {'duration': 300, 'redraw': False}, 'mode': 'immediate', 'transition': {'duration': 300}}],
        'label': str(epsilon),
        'method': 'animate'
    })

figure['layout']['sliders'] = [sliders_dict]

iplot(figure)


In [None]:
# Set budget and hyperparameters
budget8 = 23
alpha8 = 0.05
num_episodes8 = 1000
epsilon8 = 0.2

# Rewards for actions (currently all zeros)
reward8 = [0, 0, 0, 0, 0, 0, 0, 0, 0]

start_time = time.time()

Mdl8 = MCModelv1(
    data=data,
    alpha=alpha8,
    e=num_episodes8,
    epsilon=epsilon8,
    budget=budget8,
    reward=reward8
)

print(f"--- {time.time() - start_time:.2f} seconds ---")

# Plot total value per episode
plt.plot(range(num_episodes8), Mdl8[0], label="Sum of V (all actions)")
plt.title("Sum of V for all Actions at each Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()

# Plot cheapest vs other actions
plt.plot(range(num_episodes8), Mdl8[1], label="Cheapest actions")
plt.plot(range(num_episodes8), Mdl8[2], label="Other actions")
plt.title("Sum of V for Cheapest vs Other Actions per Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()


In [None]:
def MCModelv2(data, alpha, e, epsilon, budget, reward):
    Ingredients = list(set(data['Ingredient']))
    data['V'] = data['V_0'].copy()
    
    output = []
    output1 = []
    output2 = []
    actioninfull = []

    for episode_idx in range(e):
        episode_run = []

        for i in range(len(Ingredients)):
            if episode_idx == 0:
                # Initialize randomly for first episode
                episode_run.append(
                    np.random.randint(
                        1, sum(1 for p in data.iloc[:, 0] if p == i + 1) + 1
                    )
                )
            else:
                if np.random.rand() < epsilon:
                    # Explore: random action
                    episode_run.append(
                        np.random.randint(
                            1, sum(1 for p in data.iloc[:, 0] if p == i + 1) + 1
                        )
                    )
                else:
                    # Exploit: choose action with max V
                    data_I = data[data['Ingredient'] == (i + 1)]
                    max_V_rows = data_I[data_I['V'] == data_I['V'].max()]
                    chosen_product = max_V_rows['Product'].values[0]
                    episode_run.append(chosen_product)

        episode_run = np.array(episode_run, dtype=int)

        episode = pd.DataFrame({'Ingredient': Ingredients, 'Product': episode_run})
        episode['Merged_label'] = (episode['Ingredient'] * 10 + episode['Product']).astype(float)
        data['QMerged_label'] = data['QMerged_label'].astype(float)
        data['Reward'] = reward

        episode2 = episode.merge(
            data[['QMerged_label', 'Real_Cost', 'Reward']],
            left_on='Merged_label', right_on='QMerged_label',
            how='inner'
        )
        data = data.drop(columns=['Reward'])

        # Terminal reward calculation
        Return = budget - episode2['Real_Cost'].sum()
        episode2 = episode2.drop(columns=['Reward'])
        episode2['Return'] = Return

        # Update V values for actions involved in this episode
        data = data.merge(
            episode2[['Merged_label', 'Return']],
            left_on='QMerged_label', right_on='Merged_label',
            how='outer'
        )
        data['Return'] = data['Return'].fillna(0)

        for v in range(len(data)):
            if data.iloc[v, 7] != 0:
                data.iloc[v, 5] += alpha * (data.iloc[v, 7] / len(Ingredients) - data.iloc[v, 5])

        data = data.drop(columns=['Merged_label', 'Return'])

        # Track outputs
        output.append(data['V'].sum())
        output1.append(data.iloc[[1, 2, 4, 8], -1].sum())  # cheapest
        output2.append(data.iloc[[0, 3, 5, 6, 7], -1].sum())  # expensive

        # Determine optimal actions
        action_df = data.groupby('Ingredient')['V'].max().reset_index()
        action_merge = action_df.merge(data, on='V', how='inner')
        action3 = (
            action_merge.groupby('Ingredient')['Product']
            .apply(lambda x: x.iloc[np.random.randint(0, len(x))])
        )
        actioninfull.extend(action3.astype(int).tolist())

    return np.array(output), np.array(output1), np.array(output2), action3, data, np.array(actioninfull, dtype=int)


In [None]:
# Set budget and hyperparameters
budget9 = 23
alpha9 = 0.05
num_episodes9 = 1000
epsilon9 = 0.2

# Simple rewards (currently all zeros)
reward9 = [0, 0, 0, 0, 0, 0, 0, 0, 0]

start_time = time.time()

Mdl9 = MCModelv2(
    data=data,
    alpha=alpha9,
    e=num_episodes9,
    epsilon=epsilon9,
    budget=budget9,
    reward=reward9
)

print(f"--- {time.time() - start_time:.2f} seconds ---")
print(Mdl9[3])

# Plot total value per episode
plt.plot(range(num_episodes9), Mdl9[0], label="Sum of V (all actions)")
plt.title("Sum of V for all Actions at each Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()

# Plot cheapest vs other actions
plt.plot(range(num_episodes9), Mdl9[1], label="Cheapest actions")
plt.plot(range(num_episodes9), Mdl9[2], label="Other actions")
plt.title("Sum of V for Cheapest vs Other Actions per Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()


In [None]:
def MCModelv3(data, alpha, e, epsilon, budget, reward):
    Ingredients = list(set(data['Ingredient']))
    data['V'] = data['V_0'].copy()
    
    output = []
    output1 = []
    output2 = []
    actioninfull = []

    for episode_idx in range(e):
        episode_run = []

        for i in range(len(Ingredients)):
            if episode_idx == 0:
                # Initialize randomly for first episode
                episode_run.append(
                    np.random.randint(
                        1, sum(1 for p in data.iloc[:, 0] if p == i + 1) + 1
                    )
                )
            else:
                if np.random.rand() < epsilon:
                    # Explore: random action
                    episode_run.append(
                        np.random.randint(
                            1, sum(1 for p in data.iloc[:, 0] if p == i + 1) + 1
                        )
                    )
                else:
                    # Exploit: choose action with max V
                    data_I = data[data['Ingredient'] == (i + 1)]
                    max_V_rows = data_I[data_I['V'] == data_I['V'].max()]
                    chosen_product = max_V_rows['Product'].values[0]
                    episode_run.append(chosen_product)

        episode_run = np.array(episode_run, dtype=int)

        episode = pd.DataFrame({'Ingredient': Ingredients, 'Product': episode_run})
        episode['Merged_label'] = (episode['Ingredient'] * 10 + episode['Product']).astype(float)
        data['QMerged_label'] = data['QMerged_label'].astype(float)
        data['Reward'] = reward

        episode2 = episode.merge(
            data[['QMerged_label', 'Real_Cost', 'Reward']],
            left_on='Merged_label', right_on='QMerged_label',
            how='inner'
        )
        data = data.drop(columns=['Reward'])

        # Terminal reward calculation
        if budget >= episode2['Real_Cost'].sum():
            Return = 1 + episode2['Reward'].sum() / len(Ingredients)
        else:
            Return = -1 + episode2['Reward'].sum() / len(Ingredients)

        episode2 = episode2.drop(columns=['Reward'])
        episode2['Return'] = Return

        # Update V values for actions involved in this episode
        data = data.merge(
            episode2[['Merged_label', 'Return']],
            left_on='QMerged_label', right_on='Merged_label',
            how='outer'
        )
        data['Return'] = data['Return'].fillna(0)

        for v in range(len(data)):
            if data.iloc[v, 7] != 0:
                data.iloc[v, 5] += alpha * (data.iloc[v, 7] / len(Ingredients) - data.iloc[v, 5])

        data = data.drop(columns=['Merged_label', 'Return'])

        # Track outputs
        output.append(data['V'].sum())
        output1.append(data.iloc[[1, 2, 4, 8], -1].sum())  # cheapest
        output2.append(data.iloc[[0, 3, 5, 6, 7], -1].sum())  # expensive

        # Determine optimal actions
        action_df = data.groupby('Ingredient')['V'].max().reset_index()
        action_merge = action_df.merge(data, on='V', how='inner')
        action3 = (
            action_merge.groupby('Ingredient')['Product']
            .apply(lambda x: x.iloc[np.random.randint(0, len(x))])
        )
        actioninfull.extend(action3.astype(int).tolist())

    return np.array(output), np.array(output1), np.array(output2), action3, data, np.array(actioninfull, dtype=int)


In [None]:
budget10 = 30

alpha10 = 0.05
num_episodes10 = 1000
epsilon10 = 0.2

# Simple rewards for selected actions
reward10 = [0.8, 0, 0, 0.8, 0, 0, 0, 0, 0]

start_time = time.time()

Mdl10 = MCModelv3(
    data=data,
    alpha=alpha10,
    e=num_episodes10,
    epsilon=epsilon10,
    budget=budget10,
    reward=reward10
)

print(f"--- {time.time() - start_time:.2f} seconds ---")
print(Mdl10[3])

# Plot total value per episode
plt.plot(range(num_episodes10), Mdl10[0], label="Sum of V (all actions)")
plt.title("Sum of V for all Actions at each Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()

# Plot cheapest vs other actions
plt.plot(range(num_episodes10), Mdl10[1], label="Cheapest actions")
plt.plot(range(num_episodes10), Mdl10[2], label="Other actions")
plt.title("Sum of V for Cheapest vs Other Actions per Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()


In [None]:
budget11 = 30

alpha11 = 0.05
num_episodes11 = 1000
epsilon11 = 0.2

# Rewards for actions (can be tuned further)
reward11 = [0.8, 0.4, 0.5, 0.6, 0.4, 0.4, 0.6, 0.2, 0.4]

start_time = time.time()

Mdl11 = MCModelv3(
    data=data,
    alpha=alpha11,
    e=num_episodes11,
    epsilon=epsilon11,
    budget=budget11,
    reward=reward11
)

print(f"--- {time.time() - start_time:.2f} seconds ---")
print(Mdl11[3])

# Plot total value per episode
plt.plot(range(num_episodes11), Mdl11[0], label="Sum of V (all actions)")
plt.title("Sum of V for all Actions at each Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()

# Plot cheapest vs other actions
plt.plot(range(num_episodes11), Mdl10[1], label="Cheapest actions")
plt.plot(range(num_episodes11), Mdl11[2], label="Other actions")
plt.title("Sum of V for Cheapest vs Other Actions per Episode")
plt.xlabel("Episode")
plt.ylabel("Sum of V")
plt.legend()
plt.show()


# Conclusion

This work demonstrates how a Monte Carlo Reinforcement Learning model can be used to recommend products in different ways whether by keeping selections under a budget,choosing the cheapest available option or identifying the best option based on preference while still respecting cost limits..The process highlights how tuning parameters in RL directly shapes outcomes,making it possible to guide the model toward specific goals.While this experiment was carried out on simplified, sample data, the approach holds potential for more practical applications such as recommending ingredients and products from a supermarket where the variety is much larger and the decision-making more complex...Overall the project serves as a step toward exploring RL in real-world recommendation scenarios.
