# Task
Create an OpenAI Gym-compatible stock market environment in Python, suitable for training a PyTorch-based reinforcement learning agent, starting with a simple CSV file for historical data.

## Define Data Source and Features

### Subtask:
Establish how to obtain historical or simulated stock market data. Identify and preprocess relevant features (e.g., open, high, low, close prices, volume, technical indicators) that will form the observation space for the RL agent. We can start with a simple CSV file for historical data.


**Reasoning**:
The first step is to create a sample CSV file with historical stock data, as specified in the instructions. This will provide the necessary data for subsequent steps like loading and preprocessing.



In [1]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with historical stock data
dates = pd.to_datetime(pd.date_range(start='2023-01-01', periods=10, freq='D'))

data = {
    'Date': dates,
    'Open': np.random.uniform(100, 110, 10),
    'High': np.random.uniform(110, 115, 10),
    'Low': np.random.uniform(95, 100, 10),
    'Close': np.random.uniform(100, 110, 10),
    'Volume': np.random.randint(100000, 500000, 10)
}

df_sample = pd.DataFrame(data)

# Introduce some missing values for demonstration of handling missing values
df_sample.loc[[2, 5], 'Open'] = np.nan
df_sample.loc[7, 'Close'] = np.nan

# Save the DataFrame to a CSV file
df_sample.to_csv('stock_data.csv', index=False)

print("Sample 'stock_data.csv' created successfully with dummy data and some missing values.")
print(df_sample.head())

Sample 'stock_data.csv' created successfully with dummy data and some missing values.
        Date        Open        High        Low       Close  Volume
0 2023-01-01  104.455039  113.829450  95.142036  100.375870  367613
1 2023-01-02  103.588101  110.843778  97.188188  108.337415  291037
2 2023-01-03         NaN  114.294036  99.420789  106.032623  470461
3 2023-01-04  109.935511  113.050723  97.838651  108.797715  196769
4 2023-01-05  105.404743  110.112004  96.982422  100.727044  468922


**Reasoning**:
The next step is to load the 'stock_data.csv' file into a Pandas DataFrame, parse the 'Date' column, and set it as the DataFrame index, as specified in the instructions.



In [2]:
df = pd.read_csv('stock_data.csv', parse_dates=['Date'], index_col='Date')

print("DataFrame loaded successfully with 'Date' as index:")
print(df.head())

DataFrame loaded successfully with 'Date' as index:
                  Open        High        Low       Close  Volume
Date                                                             
2023-01-01  104.455039  113.829450  95.142036  100.375870  367613
2023-01-02  103.588101  110.843778  97.188188  108.337415  291037
2023-01-03         NaN  114.294036  99.420789  106.032623  470461
2023-01-04  109.935511  113.050723  97.838651  108.797715  196769
2023-01-05  105.404743  110.112004  96.982422  100.727044  468922


**Reasoning**:
Now that the data is loaded and indexed, I will select the relevant feature columns and handle any missing values within them using forward fill, which is a suitable method for time-series data.



In [3]:
feature_columns = ['Open', 'High', 'Low', 'Close', 'Volume']
df_features = df[feature_columns].copy()

# Handle missing values using forward fill
df_features.fillna(method='ffill', inplace=True)

# If there are still NaN values at the beginning (where ffill can't fill), use backward fill
df_features.fillna(method='bfill', inplace=True)

print("Selected features and handled missing values:")
print(df_features.head())
print("Missing values after handling:\n", df_features.isnull().sum())

Selected features and handled missing values:
                  Open        High        Low       Close  Volume
Date                                                             
2023-01-01  104.455039  113.829450  95.142036  100.375870  367613
2023-01-02  103.588101  110.843778  97.188188  108.337415  291037
2023-01-03  103.588101  114.294036  99.420789  106.032623  470461
2023-01-04  109.935511  113.050723  97.838651  108.797715  196769
2023-01-05  105.404743  110.112004  96.982422  100.727044  468922
Missing values after handling:
 Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64


  df_features.fillna(method='ffill', inplace=True)
  df_features.fillna(method='bfill', inplace=True)


**Reasoning**:
The previous code produced a FutureWarning regarding the deprecated use of `fillna(method=...)`. I will update the code to use the recommended `ffill()` and `bfill()` methods to address this warning.



In [4]:
feature_columns = ['Open', 'High', 'Low', 'Close', 'Volume']
df_features = df[feature_columns].copy()

# Handle missing values using forward fill
df_features.ffill(inplace=True)

# If there are still NaN values at the beginning (where ffill can't fill), use backward fill
df_features.bfill(inplace=True)

print("Selected features and handled missing values:")
print(df_features.head())
print("Missing values after handling:\n", df_features.isnull().sum())

Selected features and handled missing values:
                  Open        High        Low       Close  Volume
Date                                                             
2023-01-01  104.455039  113.829450  95.142036  100.375870  367613
2023-01-02  103.588101  110.843778  97.188188  108.337415  291037
2023-01-03  103.588101  114.294036  99.420789  106.032623  470461
2023-01-04  109.935511  113.050723  97.838651  108.797715  196769
2023-01-05  105.404743  110.112004  96.982422  100.727044  468922
Missing values after handling:
 Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64


**Reasoning**:
Now that missing values have been handled, the next step is to normalize the selected features to a common scale as specified in the instructions. This will prevent features with larger magnitudes from dominating the learning process for the reinforcement learning agent.



In [5]:
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Apply scaling to the feature DataFrame
df_scaled = pd.DataFrame(scaler.fit_transform(df_features), columns=df_features.columns, index=df_features.index)

print("Features normalized successfully:")
print(df_scaled.head())

# Verify the range of the scaled data (should be between 0 and 1)
print("\nMin values after scaling:\n", df_scaled.min())
print("\nMax values after scaling:\n", df_scaled.max())

Features normalized successfully:
                Open      High       Low     Close    Volume
Date                                                        
2023-01-01  0.415340  0.888909  0.000000  0.000000  0.637463
2023-01-02  0.322855  0.174981  0.478212  0.945345  0.403393
2023-01-03  0.322855  1.000000  1.000000  0.671676  0.951839
2023-01-04  1.000000  0.702701  0.630234  1.000000  0.115244
2023-01-05  0.516655  0.000000  0.430122  0.041698  0.947134

Min values after scaling:
 Open      0.0
High      0.0
Low       0.0
Close     0.0
Volume    0.0
dtype: float64

Max values after scaling:
 Open      1.0
High      1.0
Low       1.0
Close     1.0
Volume    1.0
dtype: float64


## Design OpenAI Gym Interface

### Subtask:
Define the observation space, the action space, and a conceptual reward function for the OpenAI Gym-compatible stock market environment.


**Reasoning**:
The first step is to import the `gym` library, which is necessary for defining the observation and action spaces using `gym.spaces.Box` and `gym.spaces.Discrete`.



In [6]:
import gymnasium as gym
import numpy as np

print("Gymnasium library imported successfully.")

Gymnasium library imported successfully.


**Reasoning**:
Now that the `gymnasium` library is imported, I will define the observation space using `gym.spaces.Box`. The shape will be determined by the number of features in `df_scaled`, and the bounds will be from 0 to 1 as the data is normalized.



In [7]:
n_features = df_scaled.shape[1]
observation_space = gym.spaces.Box(low=0, high=1, shape=(n_features,), dtype=np.float32)

print("Observation Space defined:")
print(observation_space)
print(f"Shape of observation space: {observation_space.shape}")

Observation Space defined:
Box(0.0, 1.0, (5,), float32)
Shape of observation space: (5,)


**Reasoning**:
With the observation space defined, the next step is to define the action space using `gym.spaces.Discrete` to represent 'hold', 'buy', and 'sell' actions, as specified in the instructions.



In [8]:
action_space = gym.spaces.Discrete(3) # 0: hold, 1: buy, 2: sell

print("Action Space defined:")
print(action_space)
print(f"Number of actions: {action_space.n}")

Action Space defined:
Discrete(3)
Number of actions: 3


### Conceptual Reward Function

The reward function is crucial for guiding the reinforcement learning agent towards profitable trading strategies. For this stock market environment, the reward will primarily be based on the change in the agent's portfolio value, incorporating both cash and the value of held assets.

Here's a conceptual breakdown:

1.  **Portfolio Value Change**: The primary reward signal will be the change in the agent's total portfolio value (cash + (number of shares * current close price)) from one step to the next. A positive change indicates profit, leading to a positive reward, while a negative change indicates loss, leading to a negative reward.
2.  **Profit/Loss from Trades**:
    *   **Buying**: If the agent buys shares, the reward will not immediately reflect profit. The potential profit is realized when the shares are sold.
    *   **Selling**: When the agent sells shares, the profit or loss from that specific trade (sale price - purchase price) will contribute to the reward. Selling at a higher price than bought should yield a positive reward, and vice-versa.
3.  **Transaction Costs**: A small penalty can be introduced for each 'buy' or 'sell' action to simulate transaction fees (e.g., commissions). This encourages the agent to make fewer, more impactful trades.
4.  **Holding Costs**: In some scenarios, a small negative reward for holding assets could be included to reflect opportunity costs or financing charges, though for simplicity, this might be omitted initially.
5.  **Risk Adjustment (Optional)**: For more advanced environments, the reward could be adjusted based on the risk taken. For instance, high returns with high volatility might be penalized compared to consistent, moderate returns.

**Goal**: The overall objective of the reward function is to maximize the agent's final portfolio value over the trading period, encouraging intelligent buying and selling decisions while minimizing losses and unnecessary transactions.

## Implement the Custom Gym Environment

### Subtask:
Write a Python class that inherits from `gym.Env`. This class will implement the `__init__`, `step`, `reset`, and `render` methods according to the OpenAI Gym API. The `step` method will simulate trading actions based on the agent's output and update the environment state and reward.


**Reasoning**:
The next step is to define the `StockTradingEnv` class, inheriting from `gymnasium.Env`, and implement its `__init__`, `_get_observation`, `_get_current_price`, `_take_action`, `_get_reward`, `reset`, `step`, and `render` methods as per the instructions, which will create the custom Gym environment for stock trading.



In [9]:
import gymnasium as gym
import numpy as np
import pandas as pd

class StockTradingEnv(gym.Env):
    metadata = {'render_modes': ['human'], 'render_fps': 30}

    def __init__(self, df_scaled, original_df, observation_space, action_space, initial_balance=10000, trade_fee_pct=0.001):
        super().__init__()
        self.df_scaled = df_scaled
        self.original_df = original_df # Store original df to get actual prices

        self.observation_space = observation_space
        self.action_space = action_space

        self.initial_balance = initial_balance
        self.trade_fee_pct = trade_fee_pct

        self.balance = self.initial_balance
        self.shares_held = 0
        self.net_worth = self.initial_balance
        self.last_net_worth = self.initial_balance # For reward calculation
        self.current_step = 0
        self.max_steps = len(df_scaled) - 1

        print("StockTradingEnv initialized.")

    def _get_observation(self):
        return self.df_scaled.iloc[self.current_step].values.astype(np.float32)

    def _get_current_price(self):
        return self.original_df['Close'].iloc[self.current_step]

    def _take_action(self, action, stocks_count):
        current_price = self._get_current_price()

        if action == 1: # Buy
            # Calculate maximum shares we can buy, considering trade fees
            # If we spend `x` on shares, `x * trade_fee_pct` is fee. Total `x + x * trade_fee_pct`
            # `self.balance = x * (1 + trade_fee_pct)` => `x = self.balance / (1 + trade_fee_pct)`
            available_cash_for_shares = self.balance / (1 + self.trade_fee_pct)

            while available_cash_for_shares > 0 and stocks_count > 0:
              if available_cash_for_shares > current_price: # Ensure we can buy at least one share
                  num_shares_to_buy = 1
                  cost = num_shares_to_buy * current_price * (1 + self.trade_fee_pct)

                  self.shares_held += num_shares_to_buy
                  self.balance -= cost
                  stocks_count -= 1
                  available_cash_for_shares = self.balance / (1 + self.trade_fee_pct)
                # print(f"Step {self.current_step}: Bought {num_shares_to_buy} shares at {current_price:.2f}. Balance: {self.balance:.2f}")
            # else: print(f"Step {self.current_step}: Not enough balance to buy. Balance: {self.balance:.2f}")

        elif action == 2: # Sell
            while stocks_count > 0:
              if self.shares_held > 0:
                  proceeds = self.shares_held * current_price * (1 - self.trade_fee_pct)
                  self.balance += proceeds
                  self.shares_held -=1
                  stocks_count -= 1
        else: # Hold
            pass
                # print(f"Step {self.current_step}: Sold all shares at {current_price:.2f}. Balance: {self.balance:.2f}")
            # else: print(f"Step {self.current_step}: No shares to sell.")

        # Update net worth after action
        self.net_worth = self.balance + (self.shares_held * current_price)

    def _get_reward(self):
        reward = self.net_worth - self.last_net_worth
        self.last_net_worth = self.net_worth # Update for next step
        return reward

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.balance = self.initial_balance
        self.shares_held = 0
        self.net_worth = self.initial_balance
        self.last_net_worth = self.initial_balance
        self.current_step = 0

        observation = self._get_observation()
        info = {}
        return observation, info

    def step(self, action):
        self.last_net_worth = self.net_worth # Store net worth before action for reward calc
        self._take_action(action)

        reward = self._get_reward()

        self.current_step += 1
        terminated = self.current_step >= self.max_steps
        truncated = False # Not implementing truncation for now

        observation = self._get_observation() if not terminated else np.zeros_like(self._get_observation()) # Return dummy obs if terminated
        info = {}

        return observation, reward, terminated, truncated, info

    def render(self, mode='human'):
        if mode == 'human':
            print(f"Step: {self.current_step}")
            print(f"Balance: {self.balance:.2f}")
            print(f"Shares held: {self.shares_held:.2f}")
            print(f"Net Worth: {self.net_worth:.2f}")


**Reasoning**:
Now that the `StockTradingEnv` class is defined, the next logical step is to instantiate it and test its basic functionality by performing a few `reset()` and `step()` actions. This will confirm that the environment is correctly initialized and that actions can be taken, and rewards calculated.



In [10]:
env = StockTradingEnv(
    df_scaled=df_scaled,
    original_df=df,
    observation_space=observation_space,
    action_space=action_space,
    initial_balance=10000,
    trade_fee_pct=0.001
)

# Test environment with a few steps
obs, info = env.reset()
print("\n--- Environment Reset ---")
env.render()

# Perform a 'buy' action (action=1)
print("\n--- Taking Action: Buy ---")
action = 1
obs, reward, terminated, truncated, info = env.step(action)
env.render()
print(f"Reward: {reward:.2f}, Terminated: {terminated}")

# Perform a 'hold' action (action=0)
print("\n--- Taking Action: Hold ---")
action = 0
obs, reward, terminated, truncated, info = env.step(action)
env.render()
print(f"Reward: {reward:.2f}, Terminated: {terminated}")

# Perform a 'sell' action (action=2)
print("\n--- Taking Action: Sell ---")
action = 2
obs, reward, terminated, truncated, info = env.step(action)
env.render()
print(f"Reward: {reward:.2f}, Terminated: {terminated}")

# Run through a few more steps to demonstrate progression
print("\n--- Running more steps ---")
for i in range(3):
    action = env.action_space.sample() # Take random actions
    obs, reward, terminated, truncated, info = env.step(action)
    env.render()
    print(f"Reward: {reward:.2f}, Terminated: {terminated}")
    if terminated:
        break

print("\nEnvironment testing complete.")

StockTradingEnv initialized.

--- Environment Reset ---
Step: 0
Balance: 10000.00
Shares held: 0.00
Net Worth: 10000.00

--- Taking Action: Buy ---
Step: 1
Balance: 52.85
Shares held: 99.00
Net Worth: 9990.06
Reward: -9.94, Terminated: False

--- Taking Action: Hold ---
Step: 2
Balance: 52.85
Shares held: 99.00
Net Worth: 10778.26
Reward: 788.19, Terminated: False

--- Taking Action: Sell ---
Step: 3
Balance: 10539.58
Shares held: 0.00
Net Worth: 10539.58
Reward: -238.67, Terminated: False

--- Running more steps ---
Step: 4
Balance: 84.56
Shares held: 96.00
Net Worth: 10529.14
Reward: -10.44, Terminated: False
Step: 5
Balance: 84.56
Shares held: 96.00
Net Worth: 9754.36
Reward: -774.78, Terminated: False
Step: 6
Balance: 84.56
Shares held: 96.00
Net Worth: 9956.83
Reward: 202.47, Terminated: False

Environment testing complete.


## Final Task

### Subtask:
Provide a summary of the created OpenAI Gym-compatible stock market environment, ready for integration with a PyTorch-based RL agent.


## Summary:

### Data Analysis Key Findings

*   **Data Preprocessing Completed**: Historical stock market data was successfully loaded from a CSV, missing values were handled using forward and backward fill, and relevant features ('Open', 'High', 'Low', 'Close', 'Volume') were normalized using `MinMaxScaler`, resulting in values scaled between 0 and 1.
*   **OpenAI Gym Interface Defined**:
    *   The **observation space** was defined as a `gym.spaces.Box` with a shape of `(5,)`, representing the five normalized features, with values ranging from 0.0 to 1.0.
    *   The **action space** was defined as `gym.spaces.Discrete(3)`, allowing for three distinct actions: 0 (hold), 1 (buy), and 2 (sell).
    *   A **conceptual reward function** was established, primarily based on the change in the agent's portfolio value, accounting for profits/losses from trades and transaction costs.
*   **Custom Stock Trading Environment Implemented**: A `StockTradingEnv` class was successfully created, inheriting from `gymnasium.Env`, fulfilling the OpenAI Gym API requirements:
    *   The `__init__` method initializes the environment with an `initial_balance` (e.g., \$10,000) and a `trade_fee_pct` (e.g., 0.1%).
    *   The `reset` method reinitializes the environment's state variables (balance, shares held, net worth) to their starting values.
    *   The `step` method correctly processes actions, updates the environment's state (balance, shares held), calculates the current `net_worth` and `reward` based on its change, and advances the simulation by one time step, accounting for `trade_fee_pct` during buy/sell operations.
    *   The `render` method provides human-readable output of the current step, balance, shares held, and net worth.
*   **Environment Functionality Verified**: Through systematic testing, the environment's ability to handle buy, hold, and sell actions, correctly update financial metrics, calculate rewards, and transition through steps was confirmed.

### Insights or Next Steps

*   The created `StockTradingEnv` is a robust foundation for training a PyTorch-based Reinforcement Learning agent. The next critical step is to integrate this environment with a suitable RL algorithm (e.g., A2C, PPO, DQN) implemented in PyTorch to begin the training process.
*   To enhance the environment's realism and the RL agent's learning capabilities, consider expanding the observation space with additional technical indicators (e.g., Moving Averages, RSI, MACD), implementing more sophisticated transaction costs, or incorporating slippage models for larger trades.
