## this notebook merges the concept of Hill Climbing and gym environments(to tackle knows issues of Lab midsem exams)

In [16]:
import numpy as np
'''
To use hill climbing, you first define a policy parameterized by a set of weights. A simple choice is a linear policy that computes a score for each action from the state. For example, if you use a weight matrix W of shape (3, 2) (three actions and two state features), you can compute:
'''
def policy(weights,state):
    #state is (2,) which is the position and velocity value
    #weights is the matrix with dim (action_space.n,state) so here it is (3,2) as we have 3 possible actions
    
    # so we implement a Linear policy and return the action which has the highest value as the action to take
    scores = np.dot(weights,state)
    return np.argmax(scores)
'''
This simple linear policy maps the state to an action by taking the dot product with each row of the weight matrix and then selecting the action with the highest resulting value.
'''

'\nThis simple linear policy maps the state to an action by taking the dot product with each row of the weight matrix and then selecting the action with the highest resulting value.\n'

In [17]:
'''
You need a way to judge the quality of a particular policy (set of weights). The evaluation function runs one or several episodes in the environment using the current policy and returns an average performance score (for Mountain Car, a higher score means fewer steps to reach the goal since you receive –1 per time step):
'''

import gymnasium as gym
def evaluate(weights,env,episodes =5):
    #this function is to evaluate the performance of the policy by running on episodes
    total_reward = 0
    for _ in range(episodes):
        state,info = env.reset()
        done =False
        truncated = False
        while not done and not truncated:
            action = policy(weights,state)#choose action
            state,reward,done,truncated,info = env.step(action)
            total_reward += reward
    return total_reward/episodes#so we return the cumulative avg reward as policy performance issue

In [18]:
np.random.seed(42)
import random
random.seed(42)

In [19]:
## simple hill climbing

'''
In simple hill climbing, you start with a random set of weights. Then, at each iteration, you generate a single “neighbor” (by adding a small random perturbation to the weights) and update your weights only if the new weights result in a better score.
'''
def simple_hill_climbing(env,iterations = 50,noise_scale = 0.1):
    #init weights 
    best_weights = np.random.randn(3,2)
    best_score = evaluate(best_weights,env)
    
    for i in range(iterations):
        #find one neighbor
        new_weights = best_weights + np.random.randn(3,2) * noise_scale
        new_score = evaluate(new_weights,env)
        if new_score > best_score:
            best_score,best_weights = new_score,new_weights
            print(f"Iteration {i}: Improved score to {best_score}")
    
    return best_weights

In [20]:
'''
In steepest ascent hill climbing, you generate several candidate neighbors at each iteration, evaluate all of them, and choose the best candidate (if it improves over the current best).
'''
def steepest_hill_climbing(env,iterations = 50,noise_scale = 0.1,num_neighbours = 10):
    best_weights = np.random.randn(3,2)
    best_score = evaluate(best_weights,env)
    
    for i in range(iterations):
        neighbors = [best_weights + np.random.rand(3,2)*noise_scale  for _ in range(num_neighbours)]
        scores = [evaluate(neighbors[i],env) for i in range(num_neighbours)]
        max_score = np.argmax(scores)
        if scores[max_score] > best_score:
            best_score,best_weights = scores[max_score],neighbors[max_score]
            print(f"Iteration {i}: Improved score to {best_score}")
    return best_weights

In [21]:
'''
Stochastic hill climbing also generates multiple neighbors, but rather than always choosing the best one, it selects one probabilistically according to their performance. This can help in exploring the search space more broadly
'''
def stochastic_hill_climbing(env, iterations=50, noise_scale=0.1, num_neighbors=10):
    best_weights = np.random.randn(3, 2)
    best_score = evaluate(best_weights, env)
    
    for i in range(iterations):
        neighbors = [best_weights + np.random.randn(3, 2) * noise_scale for _ in range(num_neighbors)]
        scores = [evaluate(candidate, env) for candidate in neighbors]
        
        # Adjust scores to be positive (since rewards are negative in Mountain Car)
        min_score = min(scores)
        adjusted_scores = [score - min_score + 1e-6 for score in scores]
        probabilities = np.array(adjusted_scores) / np.sum(adjusted_scores)
        
        # Choose one neighbor based on the probability distribution
        chosen_index = np.random.choice(range(num_neighbors), p=probabilities)
        if scores[chosen_index] > best_score:
            best_weights, best_score = neighbors[chosen_index], scores[chosen_index]
            print(f"Iteration {i}: Improved score to {best_score}")
    return best_weights


In [28]:
env = gym.make('MountainCar-v0')

print("Running Simple Hill Climbing")
best_weights_simple = simple_hill_climbing(env)

print("\nRunning Steepest Ascent Hill Climbing")
best_weights_steepest = steepest_hill_climbing(env)

print("\nRunning Stochastic Hill Climbing")
best_weights_stochastic = stochastic_hill_climbing(env)

# Optionally, test the best weights by rendering an episode
state,info = env.reset()
done = False
truncated = False
total_reward =0
while not done and not truncated:
    env.render()
    action = policy(best_weights_simple, state)  # or best_weights_steepest / best_weights_stochastic
    state, reward, done, truncated,info = env.step(action)
    total_reward+=1
env.close()

print(total_reward)

Running Simple Hill Climbing

Running Steepest Ascent Hill Climbing

Running Stochastic Hill Climbing
200


  gym.logger.warn(


Yes, exactly. In this approach, you're directly optimizing the weights of your linear policy. The hill climbing algorithm tweaks the weights based on how well the policy performs in the environment (i.e., the cumulative reward over an episode). By iteratively adjusting and evaluating the weights, you're effectively "improving" the policy so that it chooses actions leading to better overall performance.

To summarize:

- **Policy:** A linear mapping defined by weights.
- **Objective:** Optimize the weights so that the policy's actions yield higher cumulative rewards.
- **Method:** Use hill climbing (simple, steepest, or stochastic) to explore weight adjustments and select the best performing ones.

This is a common strategy for environments where you can parameterize the decision-making process with a set of weights.

Yes, in principle you can apply hill climbing to other Gym environments like CartPole or Taxi. The overall approach remains the same:

- **Parameterized Policy:**  
  You'll define a policy with parameters (e.g., a weight matrix for a linear policy) that maps states to actions. For instance, for CartPole, where the state has four dimensions, your weight matrix might have shape `(n_actions, 4)`.

- **Evaluation:**  
  You still evaluate the policy by running full episodes and calculating the cumulative reward. The reward structure might differ—CartPole gives +1 per time step until failure, and Taxi has its own discrete rewards—but the idea is to maximize the total reward.

- **Hill Climbing Process:**  
  You generate candidate solutions (neighbors) by slightly perturbing the current parameters and update them if they improve the cumulative reward.

Keep in mind that the reward signals in different environments can affect how noticeable the improvements are. For example, if CartPole provides a consistently positive reward, even small improvements in survival time might be more obvious than in an environment like Mountain Car where each step carries a penalty.

Also, for environments like Taxi, where the state is typically represented as a discrete integer, you might need a different policy representation or feature encoding (e.g., one-hot encoding or a small neural network) to make hill climbing effective.

So yes, hill climbing can be applied to other Gym environments, but you may need to adjust your policy representation and evaluation strategy based on the specific characteristics of each environment.