# STAGE 4 - Predicting Stock Close Price Using Machine Learning and Time Series Models

**Import Libraries**


- **`pymysql`**: Connects and interacts with the MySQL database.
- **`pandas`**: Handles data manipulation and analysis, particularly with time-series data.
- **`numpy`**: Performs numerical computations and operations on data.
- **`sklearn.model_selection.train_test_split`**: Splits data into training and testing sets for machine learning models.
- **`sklearn.linear_model.LinearRegression`**: Implements linear regression for predicting stock prices.
- **`sklearn.metrics.mean_absolute_error`**: Evaluates model performance by calculating Mean Absolute Error.
- **`sklearn.metrics.mean_squared_error`**: Evaluates model performance by calculating Mean Squared Error.
- **`xgboost.XGBRegressor`**: Implements XGBoost for regression tasks, particularly for stock price prediction.
- **`statsmodels.tsa.arima.model.ARIMA`**: Implements ARIMA model for time-series forecasting.
- **`sklearn.ensemble.RandomForestRegressor`**: Implements Random Forest for regression tasks to predict stock prices.
- **`datetime`**: Handles date and time operations for calculating date ranges.
- **`timedelta`**: Represents durations of time, used for date calculations.
- **`warnings`**: Suppresses specific warnings in the code to cle learning models.

In [1]:
import pymysql  # MySQL connection
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from xgboost import XGBRegressor
from statsmodels.tsa.arima.model import ARIMA
from sklearn.ensemble import RandomForestRegressor
from datetime import datetime, timedelta
import warnings

In [2]:
# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning)

**Connect to MySQL and Load Data**

- **Purpose**: This function connects to a MySQL database, fetches stock data for a given company (from the last `days`), and cleans the data.
- **Process**:
  1. Establishes a connection to MySQL.
  2. Queries the stock data for the last `days` from the specified table.
  3. Removes commas from numeric columns and converts them to numeric format.
  4. Returns the cleaned stock data.
- **Error Handling**: Catches MySQL errors if the connection or query fails.

In [3]:
# Step 1: Connect to MySQL and load the data
def load_data(table_name, days=1000):
    try:
        # Connect to MySQL using pymysql
        conn = pymysql.connect(host="localhost", user="root", password="Onmyway09@", database="STOCK_PREDICTION")
        
        # Fetch last `days` of stock data
        date_from = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d")
        
        # Query for stock data
        stock_query = f"SELECT `Date`, `OPEN`, `HIGH`, `LOW`, `PREV. CLOSE`, `close`, `VOLUME` FROM {table_name} WHERE Date >= '{date_from}'"
        stock_data = pd.read_sql(stock_query, conn)
        
        # Close the connection
        conn.close()

        # Handle columns that may contain commas in numeric data
        numeric_columns = ['OPEN', 'HIGH', 'LOW', 'PREV. CLOSE', 'close', 'VOLUME']
        for col in numeric_columns:
            if col in stock_data.columns:
                stock_data[col] = stock_data[col].replace({',': ''}, regex=True)  # Remove commas
                stock_data[col] = pd.to_numeric(stock_data[col], errors='coerce')  # Convert to numeric (with coercion for errors)

        # Return the cleaned data
        return stock_data
    except pymysql.MySQLError as err:
        print(f"An error occurred while connecting to MySQL: {err}")
        return None

**Feature Engineering**

- **Purpose**: Prepares the stock data by setting a date index and creating new features for modeling.
- **Process**:
  1. Converts the 'Date' column to datetime and sets it as the index.
  2. Sorts the data by date and assigns a daily frequency to the index.
  3. Renames the 'PREV. CLOSE' column to 'Prev Close' for easier use.
  4. Drops rows with missing values.
- **Output**: Returns the cleaned and processed data with the necessary features for modeling.

In [4]:
# Step 2: Feature Engineering
def feature_engineering(data):
    data['Date'] = pd.to_datetime(data['Date'])
    data.set_index('Date', inplace=True)
    data.sort_values('Date', inplace=True)

    # Assign a frequency to the index (daily frequency)
    if not data.index.freq:
        data = data.asfreq('D')

    data['Prev Close'] = data['PREV. CLOSE']  # Rename for easier use
    data.dropna(inplace=True)  # Drop any rows with missing values
    return data

**Prepare the Data for Modeling**

- **Purpose**: Splits the data into features (X) and target variable (y) for machine learning models.
- **Process**:
  1. Selects relevant features: 'OPEN', 'HIGH', 'LOW', 'VOLUME', and 'Prev Close'.
  2. Defines 'close' as the target variable (y).
- **Output**: Returns two variables: `X` (features) and `y` (target).

In [5]:
# Step 3: Prepare the data for modeling
def prepare_data(data):
    features = ['OPEN', 'HIGH', 'LOW', 'VOLUME', 'Prev Close']
    X = data[features]
    y = data['close']  # Target variable
    return X, y

**Linear Regression Model**

- **Purpose**: Fits a Linear Regression model to the training data and evaluates its performance.
- **Process**:
  1. Initializes and fits a Linear Regression model to the training data (`X_train`, `y_train`).
  2. Predicts the target values (`y_pred`) using the test data (`X_test`).
  3. Calculates the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to evaluate the model’s performance.
- **Output**: Returns the trained Linear Regression model and prints the MAE and RMSE metrics.

In [6]:
# Step 4: Linear Regression Model
def linear_regression_model(X_train, y_train, X_test, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"Linear Regression MAE: {mae}, RMSE: {rmse}")
    return model

**XGBoost Model**

- **Purpose**: Fits an XGBoost regression model to the training data and evaluates its performance.
- **Process**:
  1. Initializes and trains the XGBoost model with 500 estimators and a fixed random state.
  2. Predicts the target values (`y_pred`) using the test data (`X_test`).
  3. Calculates the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to evaluate the model’s accuracy.
- **Output**: Returns the trained XGBoost model and prints the MAE and RMSE metrics.

In [7]:
# Step 5: XGBoost Model
def xgboost_model(X_train, y_train, X_test, y_test):
    model = XGBRegressor(n_estimators=500, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"XGBoost MAE: {mae}, RMSE: {rmse}")
    return model

**Time Series Forecasting using ARIMA**

- **Purpose**: Uses the ARIMA model to forecast the stock’s closing price.
- **Process**:
  1. Ensures the data is sorted by date and the datetime index is properly set to daily frequency.
  2. Selects the `close` price for forecasting.
  3. Fits an ARIMA model with order (5, 1, 0) to the closing prices.
  4. Forecasts the next day’s closing price using the trained model.
- **Output**: Returns the predicted closing price for the next day and prints the forecast.

In [8]:
# Step 6: Time Series Forecasting using ARIMA
def time_series_forecasting(data):
    try:
        # Ensure the data is sorted by date
        data = data.sort_index()

        # Set the frequency of the datetime index
        data.index = pd.to_datetime(data.index)
        data = data.asfreq('D')

        close_prices = data['close']
        model = ARIMA(close_prices, order=(5, 1, 0))
        model_fit = model.fit()
        
        forecast = model_fit.forecast(steps=1)
        print("ARIMA Forecast:", forecast)
        
        return forecast.iloc[0]
    except Exception as e:
        print(f"ARIMA Error: {e}")
        return None

**Random Forest Model**

- **Purpose**: Applies the Random Forest Regressor to predict the stock’s closing price.
- **Process**:
  1. Initializes the RandomForestRegressor with 500 estimators and a fixed random seed.
  2. Fits the model to the training data (`X_train`, `y_train`).
  3. Predicts the closing price on the test data (`X_test`).
  4. Evaluates the model using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
- **Output**: Prints MAE and RMSE values to assess model performance and returns the trained model.

In [9]:
# Step 7: Random Forest Model
def random_forest_model(X_train, y_train, X_test, y_test):
    model = RandomForestRegressor(n_estimators=500, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"Random Forest MAE: {mae}, RMSE: {rmse}")
    return model

The `main` function allows users to predict a company's stock price using multiple models.

1. **User Input**: 
   - Enter a company name and number of days for data (default 1000 days).
   
2. **Data Loading**:
   - Loads stock data for the company using `load_data` function.

3. **Feature Engineering**:
   - Prepares the data by cleaning and sorting it.

4. **Model Predictions**:
   - Runs predictions using:
     - **Linear Regression**
     - **XGBoost**
     - **Random Forest**
     - **ARIMA (Time Series)**

5. **Results Display**:
   - Shows predictions from each model.

6. **Error Handling**: 
   - Catches and prints any errors.

The function loops until the user types "stop" or "exit".

In [None]:
def main():
    try:
        while True:
            # Get user input for table name
            table_name = input("Enter the Company Name (or type 'stop' or 'exit' to end):").strip()
            
            # Check for stop/exit condition
            if table_name.lower() in ["stop", "exit"]:
                print("\n \033[1mExiting the program. Goodbye!\033[0m \n")
                break

            days_input = input("Enter the Number of Days for Data Analysis (default: 1000):").strip()
            
            try:
                days = int(days_input) if days_input else 1000
            except ValueError:
                print("\033[1mInvalid input for days. Defaulting to 1000.\033[0m")
                days = 1000
            
            print(f"\nSelected Company: {table_name}")
            print(f"Fetching data for the last {days} days...\n")

            # Load data
            data = load_data(table_name, days)

            if data is None or data.empty:
                print("\033[1mNo data found for the selected table. Please try again.\n\033[0m")
                continue

            # Feature Engineering
            data = feature_engineering(data)

            # Prepare data
            X, y = prepare_data(data)
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # Results dictionary to store model outputs
            results = {}

            # Linear Regression
            print("\n--- \033[1mLinear Regression\033[0m ---")
            lr_model = linear_regression_model(X_train, y_train, X_test, y_test)
            lr_pred = lr_model.predict([X.iloc[-1].values])  # Ensure it's a 2D array
            results["Linear Regression"] = {"Prediction": lr_pred[0]}

            # XGBoost
            print("\n--- \033[1mXGBoost\033[0m ---")
            xgb_model = xgboost_model(X_train, y_train, X_test, y_test)
            xgb_pred = xgb_model.predict([X.iloc[-1].values])  # Ensure it's a 2D array
            results["XGBoost"] = {"Prediction": xgb_pred[0]}

            # Random Forest
            print("\n--- \033[1mRandom Forest\033[0m ---")
            rf_model = random_forest_model(X_train, y_train, X_test, y_test)
            rf_pred = rf_model.predict([X.iloc[-1].values])  # Ensure it's a 2D array
            results["Random Forest"] = {"Prediction": rf_pred[0]}

            # ARIMA Time Series Forecasting
            print("\n--- \033[1mARIMA Time Series Forecasting\033[0m ---")
            arima_forecast = time_series_forecasting(data)
            results["ARIMA"] = {"Prediction": arima_forecast}

            # Displaying all model predictions
            print("\n--- \033[1mModel Predictions\033[0m ---")
            for model_name, result in results.items():
                print(f"{model_name}: {result['Prediction']}")

            print("\n--- \033[1mEnd of Predictions\033[0m ---\n\n")

    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    main()

Enter the Company Name (or type 'stop' or 'exit' to end): indian_hotels
Enter the Number of Days for Data Analysis (default: 1000): 



Selected Company: indian_hotels
Fetching data for the last 1000 days...


--- [1mLinear Regression[0m ---
Linear Regression MAE: 1.9002639686690457, RMSE: 2.6255859793311362

--- [1mXGBoost[0m ---
XGBoost MAE: 4.178241954130286, RMSE: 6.410708860204474

--- [1mRandom Forest[0m ---
Random Forest MAE: 3.727747794117566, RMSE: 5.977995303703566

--- [1mARIMA Time Series Forecasting[0m ---
ARIMA Forecast: 2024-11-27    798.14264
Freq: D, dtype: float64

--- [1mModel Predictions[0m ---
Linear Regression: 794.0386964721565
XGBoost: 796.74853515625
Random Forest: 793.8818999999982
ARIMA: 798.1426399032409

--- [1mEnd of Predictions[0m ---




Enter the Company Name (or type 'stop' or 'exit' to end): Tata_Motors
Enter the Number of Days for Data Analysis (default: 1000): 



Selected Company: Tata_Motors
Fetching data for the last 1000 days...


--- [1mLinear Regression[0m ---
Linear Regression MAE: 3.2401048517287863, RMSE: 4.613972673397773

--- [1mXGBoost[0m ---
XGBoost MAE: 5.785230838551244, RMSE: 9.291524365670181

--- [1mRandom Forest[0m ---
Random Forest MAE: 4.731217647058826, RMSE: 6.925359693355535

--- [1mARIMA Time Series Forecasting[0m ---
ARIMA Forecast: 2024-11-27    783.338802
Freq: D, dtype: float64

--- [1mModel Predictions[0m ---
Linear Regression: 788.0231049951979
XGBoost: 782.999267578125
Random Forest: 784.9977000000007
ARIMA: 783.3388019891581

--- [1mEnd of Predictions[0m ---




Enter the Company Name (or type 'stop' or 'exit' to end): Tata_steel
Enter the Number of Days for Data Analysis (default: 1000): Tata_Chemicals


[1mInvalid input for days. Defaulting to 1000.[0m

Selected Company: Tata_steel
Fetching data for the last 1000 days...


--- [1mLinear Regression[0m ---
Linear Regression MAE: 1.9430208026327642, RMSE: 4.058470854326655

--- [1mXGBoost[0m ---
XGBoost MAE: 11.798353181726794, RMSE: 54.30519551632143

--- [1mRandom Forest[0m ---
Random Forest MAE: 5.035686029411819, RMSE: 17.58313795583704

--- [1mARIMA Time Series Forecasting[0m ---
ARIMA Forecast: 2024-11-27    144.161193
Freq: D, dtype: float64

--- [1mModel Predictions[0m ---
Linear Regression: 145.0021113570471
XGBoost: 144.46974182128906
Random Forest: 144.53426000000047
ARIMA: 144.16119317524937

--- [1mEnd of Predictions[0m ---




Enter the Company Name (or type 'stop' or 'exit' to end): voltas
Enter the Number of Days for Data Analysis (default: 1000): 



Selected Company: voltas
Fetching data for the last 1000 days...


--- [1mLinear Regression[0m ---
Linear Regression MAE: 6.027376691339064, RMSE: 9.619217711890059

--- [1mXGBoost[0m ---
XGBoost MAE: 9.466217400045954, RMSE: 13.18784807600276

--- [1mRandom Forest[0m ---
Random Forest MAE: 8.24087867647095, RMSE: 11.338401164584436

--- [1mARIMA Time Series Forecasting[0m ---
ARIMA Forecast: 2024-11-27    1670.364357
Freq: D, dtype: float64

--- [1mModel Predictions[0m ---
Linear Regression: 1658.5594662924043
XGBoost: 1674.099853515625
Random Forest: 1670.80259999999
ARIMA: 1670.364357022422

--- [1mEnd of Predictions[0m ---




Enter the Company Name (or type 'stop' or 'exit' to end): trent
Enter the Number of Days for Data Analysis (default: 1000): 



Selected Company: trent
Fetching data for the last 1000 days...


--- [1mLinear Regression[0m ---
Linear Regression MAE: 16.062283055791937, RMSE: 28.83702794673392

--- [1mXGBoost[0m ---
XGBoost MAE: 35.628348137350656, RMSE: 79.60699842178154

--- [1mRandom Forest[0m ---
Random Forest MAE: 26.171921323528284, RMSE: 52.15667661187141

--- [1mARIMA Time Series Forecasting[0m ---
ARIMA Forecast: 2024-11-27    6655.9954
Freq: D, dtype: float64

--- [1mModel Predictions[0m ---
Linear Regression: 6702.552448990584
XGBoost: 6669.34912109375
Random Forest: 6715.754300000038
ARIMA: 6655.995400487803

--- [1mEnd of Predictions[0m ---




Enter the Company Name (or type 'stop' or 'exit' to end): titan
Enter the Number of Days for Data Analysis (default: 1000): 



Selected Company: titan
Fetching data for the last 1000 days...


--- [1mLinear Regression[0m ---
Linear Regression MAE: 12.711164503077686, RMSE: 19.149498495780847

--- [1mXGBoost[0m ---
XGBoost MAE: 20.564092658547786, RMSE: 26.875245127677246

--- [1mRandom Forest[0m ---
