# Stock Market Data Cleaning Notebook
This notebook will guide you through the process of loading, cleaning, and normalizing stock market data. It handles missing values and ensures the data is ready for further analysis.

## 1. Import Libraries
In this section, we'll import the necessary Python libraries, including `pandas` for data manipulation and `MinMaxScaler` for data normalization.

In [1]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

## 2. Load the Data
Here we will load the stock market data from a CSV file. The data will have columns like "price", "close", "high", "low", "open", and "volume".

In [2]:
# Load data
data = pd.read_csv("C:/Projects/Simplified_Stock_Market_Predictor_AI/Sandbox/test_stock_data.csv")
data.head()  # Display the first few rows to verify the data

Unnamed: 0,price,close,high,low,open,volume
0,100.0,105,110,98,102,5000
1,102.0,107,112,99,104,4500
2,,106,109,97,101,4700
3,98.0,103,108,95,97,4600
4,105.0,108,113,101,106,5200


## 3. Clean the Data
This section will handle the cleaning process:
1. Convert relevant columns to numeric values.
2. Fill missing values in the "price" column with the mean.
3. Normalize the numeric columns using Min-Max scaling.

In [11]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def clean_data(input_file, output_file):
    """
    Cleans the stock market data by handling missing values and normalizing the data.

    Args:
        input_file (str): Path to the input CSV file.
        output_file (str): Path to the output CSV file.
    """
    try:
        # Load data
        data = pd.read_csv(input_file)
        print("Data loaded successfully.")
        
        # Convert relevant columns to numeric, coercing errors to NaN
        cols = ["price", "close", "high", "low", "open", "volume"]
        data[cols] = data[cols].apply(pd.to_numeric, errors="coerce")
        print("Converted relevant columns to numeric.")
        
        # Handle NaN values in the price column specifically
        if data["price"].isna().sum() > 0:
            print("Handling NaN values in the price column.")
            data["price"] = data["price"].fillna(data["price"].mean())

        # Select numeric columns
        numeric_data = data.select_dtypes(include="number")
        print("Selected numeric columns.")
        
        # Fill missing values with the mean for other columns
        numeric_data = numeric_data.fillna(numeric_data.mean())
        print("Missing values filled.")
        
        # Normalize data using Min-Max Scaling
        scaler = MinMaxScaler()
        data_scaled = pd.DataFrame(scaler.fit_transform(numeric_data), columns=numeric_data.columns)
        print("Data normalized.")
        
        # Save cleaned data to the specific path
        data_scaled.to_csv(output_file, index=False)
        print(f"Cleaned data saved to {output_file}")

    except Exception as e:
        print(f"Error cleaning data: {str(e)}")  # Ensure error message is a string

# Specify the input and output file paths
input_file = "C:/Projects/Simplified_Stock_Market_Predictor_AI/Sandbox/test_stock_data.csv"
output_file = "C:/Projects/Simplified_Stock_Market_Predictor_AI/Sandbox/cleaned_test_stock_data.csv"

# Clean the data
clean_data(input_file, output_file)

Data loaded successfully.
Converted relevant columns to numeric.
Handling NaN values in the price column.
Selected numeric columns.
Missing values filled.
Data normalized.
Cleaned data saved to C:/Projects/Simplified_Stock_Market_Predictor_AI/Sandbox/cleaned_test_stock_data.csv


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["price"].fillna(data["price"].mean(), inplace=True)


## 4. Verify the Data
After cleaning the data, we should verify that the cleaned data looks as expected. Let's load the cleaned data and display the first few rows.

In [10]:
# Verify the test data
test_data = pd.read_csv("C:/Projects/Simplified_Stock_Market_Predictor_AI/Sandbox/test_stock_data.csv")
test_data.head()  # Display the first few rows of the test data

Unnamed: 0,price,close,high,low,open,volume
0,100.0,105,110,98,102,5000
1,102.0,107,112,99,104,4500
2,,106,109,97,101,4700
3,98.0,103,108,95,97,4600
4,105.0,108,113,101,106,5200


## 5. Conclusion
In this notebook, we've:
1. Loaded and cleaned stock market data.
2. Handled missing values by filling them with the mean.
3. Normalized the numeric columns to prepare the data for further analysis.