# *Predicting Stock Returns Using Machine Learning Models*  

---

## Introduction  
This project explores **predicting stock price returns** for all **S&P 500 stocks** using machine learning techniques. By leveraging historical data, this notebook evaluates the effectiveness of various classification models in guiding profitable trading strategies.  

## Key Objectives:  
1. **Model Comparison**:  
   - Test and compare the performance of multiple machine learning models including:  
     - Multi-layer Perceptron Classifier (MLPClassifier)  
     - Random Forest Classifier (RandomForestClassifier)  
     - Gradient Boosting Classifier (HistGradientBoostingClassifier)  
     - Logistic Regression (LogisticRegression)  

2. **Performance Insights**:  
   - Measure model accuracy and precision across all 500 stocks.  
   - Identify trends and highlight top-performing models.  

3. **Optimization and Reporting**:  
   - Select top-performing models for parameter tuning.  
   - Generate detailed reports for the best-performing stocks.

## Significance:  
By integrating machine learning into financial analysis, this project demonstrates how data-driven decision-making can enhance trading strategies, minimize risk, and maximize returns.

---  

In [4]:
import os
import pandas as pd
import yfinance as yf

In [None]:
def process_tickers():
    tickers_nasd = pd.read_csv('./tickers/tickers_nasd.csv')
    tickers_nyse = pd.read_csv('./tickers/tickers_nyse.csv')
    tickers = pd.read_csv('./tickers/tickers.csv')

    t1 = tickers_nasd['Symbol']
    t2 = tickers_nyse['Symbol']
    t3 = tickers['Ticker']

    # Combine into one DataFrame
    df = pd.DataFrame({'ticker': pd.concat([t1, t2, t3], ignore_index=True)})

    # Remove duplicates
    df = df.drop_duplicates().reset_index(drop=True)

    df.to_csv('./tickers/processed_tickers.csv', index = False)

In [3]:
process_tickers()
tickers = pd.read_csv('./tickers/processed_tickers.csv')
tickers

Unnamed: 0,ticker
0,PIH
1,TURN
2,FLWS
3,FCCY
4,SRCE
...,...
6405,ZBK
6406,ZOES
6407,ZTS
6408,ZTO


# *Fetching S&P 500 Stock Histories*

---

Retrieving historical stock data using yfinance for S&P 500 companies to train ML models and compare performance.

*This process may take a few moments...*

In [1]:
from utility import download_ticker_data

def fetch_snp500_tickers():
    """
    Fetch the list of S&P 500 tickers from Wikipedia.

    Returns:
        list: A list of S&P 500 ticker symbols.
    """
    url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
    try:
        tickers = pd.read_html(url)[0]['Symbol'].str.replace('.', '-', regex=False).tolist()
        return tickers
    except Exception as e:
        raise RuntimeError(f"Failed to fetch S&P 500 tickers: {e}")

In [None]:
snp = fetch_snp500_tickers()
download_ticker_data(tickers = snp, period="max", data_dir="./snp_stocks/")
download_ticker_data(tickers = '^VIX', period="max", data_dir="./stocks/", force_redownload=True)
download_ticker_data(tickers = 'SPY', period="max", data_dir="./stocks/", force_redownload=True)

Data for MMM already exists. Skipping.
Data for AOS already exists. Skipping.
Data for ABT already exists. Skipping.
Data for ABBV already exists. Skipping.
Data for ACN already exists. Skipping.
Data for ADBE already exists. Skipping.
Data for AMD already exists. Skipping.
Data for AES already exists. Skipping.
Data for AFL already exists. Skipping.
Data for A already exists. Skipping.
Data for APD already exists. Skipping.
Data for ABNB already exists. Skipping.
Data for AKAM already exists. Skipping.
Data for ALB already exists. Skipping.
Data for ARE already exists. Skipping.
Data for ALGN already exists. Skipping.
Data for ALLE already exists. Skipping.
Data for LNT already exists. Skipping.
Data for ALL already exists. Skipping.
Data for GOOGL already exists. Skipping.
Data for GOOG already exists. Skipping.
Data for MO already exists. Skipping.
Data for AMZN already exists. Skipping.
Data for AMCR already exists. Skipping.
Data for AMTM already exists. Skipping.
Data for AEE alr

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from utility import build_indicators

def compare_models(models, data_dir="./snp_stocks/", test_size = 0.4, start_date="2005-01-01", end_date="2024-11-23"):
    """
    Compare the performance of different models on historical stock data.

    Parameters:
        models (list): A list of sklearn model instances.
        data_dir (str): Directory containing stock data CSV files.
        start_date (str): Start date for historical data (used if downloading data is required).

    Returns:
        pd.DataFrame: DataFrame containing performance metrics for each model and stock.
    """

     # Initialize the scaler
    scaler = StandardScaler()

    # Ensure the data directory exists
    if not os.path.exists(data_dir):
        raise FileNotFoundError(f"The directory {data_dir} does not exist.")

    # Initialize results DataFrame
    results = []

    # Iterate through CSV files in the directory
    for file_name in os.listdir(data_dir):
        if not file_name.endswith(".csv"):
            continue

        stock_path = os.path.join(data_dir, file_name)
        stock_name = file_name.replace(".csv", "")

        try:
            # Load stock data
            
            df = pd.read_csv(stock_path, parse_dates=['Date'])
            df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
            print(stock_path)
            df = df.sort_values('Date')
            df.ffill(inplace = True)

            # Ensure 'Adj Close' column exists
            if 'Adj Close' not in df.columns:
                print(f"Skipping {stock_name}: 'Adj Close' column missing.")
                continue

            # Create feature and target columns
            df, indicators, flag = build_indicators(df, lookahead = 7, threshold = 0.5)

            X = df[indicators]
            y = df[flag]

            # Scale the feature data
            X_scaled = scaler.fit_transform(X)

            # Split data into train and test sets
            X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=test_size, random_state=42, shuffle=False)
            y_train = y_train.to_numpy()
            y_test = y_test.to_numpy()

            # Evaluate each model
            for model in models:
                model_name = type(model).__name__
                model.fit(X_train, y_train)

                # scores = cross_val_score(model, X_scaled, y, cv=5)

                # Predictions
                y_pred = model.predict(X_test)

                # Calculate metrics
                mse = mean_squared_error(y_test, y_pred)
                r2 = r2_score(y_test, y_pred)

                # Calculate metrics
                mse = mean_squared_error(y_test, y_pred)
                r2 = r2_score(y_test, y_pred)
                accuracy = accuracy_score(y_test, y_pred)
                precision = precision_score(y_test, y_pred)
                class_report = classification_report(y_test, y_pred, output_dict=True)
                # scores_mean = scores.mean()
                # scores_std = scores.std()

                # Append results
                results.append({
                    'Stock': stock_name,
                    'Model': model_name,
                    'MSE': mse,
                    'R2': r2,
                    'Accuracy': accuracy,
                    'Precision': precision,
                    '0_precision': class_report['0']['precision'],
                    '1_precision': class_report['1']['precision'],
                    '0_recall': class_report['0']['recall'],
                    '1_recall': class_report['1']['recall'],
                    'Classification_Report': class_report,
                    # 'Scores Mean': scores_mean,
                    # 'Scores std': scores_std
                })

        except Exception as e:
            print(f"Failed to process {stock_name}: {e}")

    # Convert results to DataFrame
    return pd.DataFrame(results)
            

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression


# *Comparing Models*

---

Evaluating the performance of the following machine learning models:

- **MLPClassifier**
- **RandomForestClassifier**
- **HistGradientBoostingClassifier**
- **LogisticRegression**

Key metrics for comparison:
- **Accuracy**: Measure of correct predictions across all instances.
- **Precision**: Measure of true positive predictions among all positive predictions.

This analysis spans all 500 S&P 500 stocks, providing insights into the strengths and weaknesses of each model. The comparison can take around 15-20 minutes to complete.

In [None]:

clf1 = MLPClassifier(hidden_layer_sizes=(100, 100, 100), activation= 'tanh',max_iter=5000, random_state=42, learning_rate_init=0.005, warm_start=True)
clf2 = RandomForestClassifier(random_state=42, n_estimators=200)
clf3 = HistGradientBoostingClassifier(max_iter=1000, random_state=42, warm_start=True)
clf4 = LogisticRegression(solver='liblinear',penalty='l1', C=5.0, random_state=42, max_iter = 2000, warm_start=True)

models = [clf1, clf2, clf3, clf4]

res = compare_models(models, data_dir="./snp_stocks/")

./snp_stocks/A.csv
./snp_stocks/AAPL.csv
./snp_stocks/ABBV.csv
./snp_stocks/ABNB.csv
./snp_stocks/ABT.csv
./snp_stocks/ACGL.csv
./snp_stocks/ACN.csv
./snp_stocks/ADBE.csv
./snp_stocks/ADI.csv
./snp_stocks/ADM.csv
./snp_stocks/ADP.csv
./snp_stocks/ADSK.csv
./snp_stocks/AEE.csv
./snp_stocks/AEP.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/AES.csv
./snp_stocks/AFL.csv
./snp_stocks/AIG.csv
./snp_stocks/AIZ.csv
./snp_stocks/AJG.csv
./snp_stocks/AKAM.csv
./snp_stocks/ALB.csv
./snp_stocks/ALGN.csv
./snp_stocks/ALL.csv
./snp_stocks/ALLE.csv
./snp_stocks/AMAT.csv
./snp_stocks/AMCR.csv
Failed to process AMCR: Input contains NaN, infinity or a value too large for dtype('float64').
./snp_stocks/AMD.csv
./snp_stocks/AME.csv
./snp_stocks/AMGN.csv
./snp_stocks/AMP.csv
./snp_stocks/AMT.csv
./snp_stocks/AMTM.csv
Failed to process AMTM: Found array with 0 sample(s) (shape=(0, 16)) while a minimum of 1 is required by StandardScaler.
./snp_stocks/AMZN.csv
./snp_stocks/ANET.csv
./snp_stocks/ANSS.csv
./snp_stocks/AON.csv
./snp_stocks/AOS.csv
./snp_stocks/APA.csv
./snp_stocks/APD.csv
./snp_stocks/APH.csv
./snp_stocks/APTV.csv
./snp_stocks/ARE.csv
./snp_stocks/ATO.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/AVB.csv
./snp_stocks/AVGO.csv
./snp_stocks/AVY.csv
./snp_stocks/AWK.csv
./snp_stocks/AXON.csv
./snp_stocks/AXP.csv
./snp_stocks/AZO.csv
./snp_stocks/BA.csv
./snp_stocks/BAC.csv
./snp_stocks/BALL.csv
./snp_stocks/BAX.csv
./snp_stocks/BBY.csv
./snp_stocks/BDX.csv
./snp_stocks/BEN.csv
./snp_stocks/BF-B.csv
./snp_stocks/BG.csv
./snp_stocks/BIIB.csv
./snp_stocks/BK.csv
./snp_stocks/BKNG.csv
./snp_stocks/BKR.csv
./snp_stocks/BLDR.csv
./snp_stocks/BLK.csv
./snp_stocks/BMY.csv
./snp_stocks/BR.csv
./snp_stocks/BRK-B.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/BRO.csv
./snp_stocks/BSX.csv
./snp_stocks/BWA.csv
./snp_stocks/BX.csv
./snp_stocks/BXP.csv
./snp_stocks/C.csv
./snp_stocks/CAG.csv
./snp_stocks/CAH.csv
./snp_stocks/CARR.csv
./snp_stocks/CAT.csv
./snp_stocks/CB.csv
./snp_stocks/CBOE.csv
./snp_stocks/CBRE.csv
./snp_stocks/CCI.csv
./snp_stocks/CCL.csv
./snp_stocks/CDNS.csv
./snp_stocks/CDW.csv
./snp_stocks/CE.csv
./snp_stocks/CEG.csv
./snp_stocks/CF.csv
./snp_stocks/CFG.csv
./snp_stocks/CHD.csv
./snp_stocks/CHRW.csv
./snp_stocks/CHTR.csv
./snp_stocks/CI.csv
./snp_stocks/CINF.csv
./snp_stocks/CL.csv
./snp_stocks/CLX.csv
./snp_stocks/CMCSA.csv
./snp_stocks/CME.csv
./snp_stocks/CMG.csv
./snp_stocks/CMI.csv
./snp_stocks/CMS.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/CNC.csv
./snp_stocks/CNP.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/COF.csv
./snp_stocks/COO.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/COP.csv
./snp_stocks/COR.csv
./snp_stocks/COST.csv
./snp_stocks/CPAY.csv
./snp_stocks/CPB.csv
./snp_stocks/CPRT.csv
./snp_stocks/CPT.csv
./snp_stocks/CRL.csv
./snp_stocks/CRM.csv
./snp_stocks/CRWD.csv
./snp_stocks/CSCO.csv
./snp_stocks/CSGP.csv
./snp_stocks/CSX.csv
./snp_stocks/CTAS.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/CTLT.csv
./snp_stocks/CTRA.csv
./snp_stocks/CTSH.csv
./snp_stocks/CTVA.csv
./snp_stocks/CVS.csv
./snp_stocks/CVX.csv
./snp_stocks/CZR.csv
./snp_stocks/D.csv
./snp_stocks/DAL.csv
./snp_stocks/DAY.csv
./snp_stocks/DD.csv
./snp_stocks/DE.csv
./snp_stocks/DECK.csv
./snp_stocks/DELL.csv
./snp_stocks/DFS.csv
./snp_stocks/DG.csv
./snp_stocks/DGX.csv
./snp_stocks/DHI.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/DHR.csv
./snp_stocks/DIS.csv
./snp_stocks/DLR.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/DLTR.csv
./snp_stocks/DOC.csv
./snp_stocks/DOV.csv
./snp_stocks/DOW.csv
./snp_stocks/DPZ.csv
./snp_stocks/DRI.csv
./snp_stocks/DTE.csv
./snp_stocks/DUK.csv
./snp_stocks/DVA.csv
./snp_stocks/DVN.csv
./snp_stocks/DXCM.csv
./snp_stocks/EA.csv
./snp_stocks/EBAY.csv
./snp_stocks/ECL.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/ED.csv
./snp_stocks/EFX.csv
./snp_stocks/EG.csv
./snp_stocks/EIX.csv
./snp_stocks/EL.csv
./snp_stocks/ELV.csv
./snp_stocks/EMN.csv
./snp_stocks/EMR.csv
./snp_stocks/ENPH.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/EOG.csv
./snp_stocks/EPAM.csv
./snp_stocks/EQIX.csv
./snp_stocks/EQR.csv
./snp_stocks/EQT.csv
./snp_stocks/ERIE.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/ES.csv
./snp_stocks/ESS.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/ETN.csv
./snp_stocks/ETR.csv
./snp_stocks/EVRG.csv
./snp_stocks/EW.csv
./snp_stocks/EXC.csv
./snp_stocks/EXPD.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/EXPE.csv
./snp_stocks/EXR.csv
./snp_stocks/F.csv
./snp_stocks/FANG.csv
./snp_stocks/FAST.csv
./snp_stocks/FCX.csv
./snp_stocks/FDS.csv
./snp_stocks/FDX.csv
./snp_stocks/FE.csv
./snp_stocks/FFIV.csv
./snp_stocks/FI.csv
./snp_stocks/FICO.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/FIS.csv
./snp_stocks/FITB.csv
./snp_stocks/FMC.csv
./snp_stocks/FOX.csv
./snp_stocks/FOXA.csv
./snp_stocks/FRT.csv
./snp_stocks/FSLR.csv
./snp_stocks/FTNT.csv
./snp_stocks/FTV.csv
./snp_stocks/GD.csv
./snp_stocks/GDDY.csv
./snp_stocks/GE.csv
./snp_stocks/GEHC.csv
./snp_stocks/GEN.csv
./snp_stocks/GEV.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/GILD.csv
./snp_stocks/GIS.csv
./snp_stocks/GL.csv
./snp_stocks/GLW.csv
./snp_stocks/GM.csv
./snp_stocks/GNRC.csv
./snp_stocks/GOOG.csv
./snp_stocks/GOOGL.csv
./snp_stocks/GPC.csv
./snp_stocks/GPN.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/GRMN.csv
./snp_stocks/GS.csv
./snp_stocks/GWW.csv
./snp_stocks/HAL.csv
./snp_stocks/HAS.csv
./snp_stocks/HBAN.csv
./snp_stocks/HCA.csv
./snp_stocks/HD.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/HES.csv
./snp_stocks/HIG.csv
./snp_stocks/HII.csv
./snp_stocks/HLT.csv
./snp_stocks/HOLX.csv
./snp_stocks/HON.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/HPE.csv
./snp_stocks/HPQ.csv
./snp_stocks/HRL.csv
./snp_stocks/HSIC.csv
./snp_stocks/HST.csv
./snp_stocks/HSY.csv
./snp_stocks/HUBB.csv
./snp_stocks/HUM.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/HWM.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/IBM.csv
./snp_stocks/ICE.csv
./snp_stocks/IDXX.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/IEX.csv
./snp_stocks/IFF.csv
./snp_stocks/INCY.csv
./snp_stocks/INTC.csv
./snp_stocks/INTU.csv
./snp_stocks/INVH.csv
./snp_stocks/IP.csv
./snp_stocks/IPG.csv
./snp_stocks/IQV.csv
./snp_stocks/IR.csv
./snp_stocks/IRM.csv
./snp_stocks/ISRG.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/IT.csv
./snp_stocks/ITW.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/IVZ.csv
./snp_stocks/J.csv
./snp_stocks/JBHT.csv
./snp_stocks/JBL.csv
./snp_stocks/JCI.csv
./snp_stocks/JKHY.csv
./snp_stocks/JNJ.csv
./snp_stocks/JNPR.csv
./snp_stocks/JPM.csv
./snp_stocks/K.csv
./snp_stocks/KDP.csv
./snp_stocks/KEY.csv
./snp_stocks/KEYS.csv
./snp_stocks/KHC.csv
./snp_stocks/KIM.csv
./snp_stocks/KKR.csv
./snp_stocks/KLAC.csv
./snp_stocks/KMB.csv
./snp_stocks/KMI.csv
./snp_stocks/KMX.csv
./snp_stocks/KO.csv
./snp_stocks/KR.csv
./snp_stocks/KVUE.csv
./snp_stocks/L.csv
./snp_stocks/LDOS.csv
./snp_stocks/LEN.csv
./snp_stocks/LH.csv
./snp_stocks/LHX.csv
./snp_stocks/LIN.csv
./snp_stocks/LKQ.csv
./snp_stocks/LLY.csv
./snp_stocks/LMT.csv
./snp_stocks/LNT.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/LOW.csv
./snp_stocks/LRCX.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/LULU.csv
./snp_stocks/LUV.csv
./snp_stocks/LVS.csv
./snp_stocks/LW.csv
./snp_stocks/LYB.csv
./snp_stocks/LYV.csv
./snp_stocks/MA.csv
./snp_stocks/MAA.csv
./snp_stocks/MAR.csv
./snp_stocks/MAS.csv
./snp_stocks/MCD.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/MCHP.csv
./snp_stocks/MCK.csv
./snp_stocks/MCO.csv
./snp_stocks/MDLZ.csv
./snp_stocks/MDT.csv
./snp_stocks/MET.csv
./snp_stocks/META.csv
./snp_stocks/MGM.csv
./snp_stocks/MHK.csv
./snp_stocks/MKC.csv
./snp_stocks/MKTX.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/MLM.csv
./snp_stocks/MMC.csv
./snp_stocks/MMM.csv
./snp_stocks/MNST.csv
./snp_stocks/MO.csv
./snp_stocks/MOH.csv
./snp_stocks/MOS.csv
./snp_stocks/MPC.csv
./snp_stocks/MPWR.csv
./snp_stocks/MRK.csv
./snp_stocks/MRNA.csv
./snp_stocks/MS.csv
./snp_stocks/MSCI.csv
./snp_stocks/MSFT.csv
./snp_stocks/MSI.csv
./snp_stocks/MTB.csv
./snp_stocks/MTCH.csv
./snp_stocks/MTD.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/MU.csv
./snp_stocks/NCLH.csv
./snp_stocks/NDAQ.csv
./snp_stocks/NDSN.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/NEE.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/NEM.csv
./snp_stocks/NFLX.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/NI.csv
./snp_stocks/NKE.csv
./snp_stocks/NOC.csv
./snp_stocks/NOW.csv
./snp_stocks/NRG.csv
./snp_stocks/NSC.csv
./snp_stocks/NTAP.csv
./snp_stocks/NTRS.csv
./snp_stocks/NUE.csv
./snp_stocks/NVDA.csv
./snp_stocks/NVR.csv
./snp_stocks/NWS.csv
./snp_stocks/NWSA.csv
./snp_stocks/NXPI.csv
./snp_stocks/O.csv
./snp_stocks/ODFL.csv
./snp_stocks/OKE.csv
./snp_stocks/OMC.csv
./snp_stocks/ON.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/ORCL.csv
./snp_stocks/ORLY.csv
./snp_stocks/OTIS.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/OXY.csv
./snp_stocks/PANW.csv
./snp_stocks/PARA.csv
./snp_stocks/PAYC.csv
./snp_stocks/PAYX.csv
./snp_stocks/PCAR.csv
./snp_stocks/PCG.csv
./snp_stocks/PEG.csv
./snp_stocks/PEP.csv
./snp_stocks/PFE.csv
./snp_stocks/PFG.csv
./snp_stocks/PG.csv
./snp_stocks/PGR.csv
./snp_stocks/PH.csv
./snp_stocks/PHM.csv
./snp_stocks/PKG.csv
./snp_stocks/PLD.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/PLTR.csv
./snp_stocks/PM.csv
./snp_stocks/PNC.csv
./snp_stocks/PNR.csv
./snp_stocks/PNW.csv
./snp_stocks/PODD.csv
./snp_stocks/POOL.csv
./snp_stocks/PPG.csv
./snp_stocks/PPL.csv
./snp_stocks/PRU.csv
./snp_stocks/PSA.csv
./snp_stocks/PSX.csv
./snp_stocks/PTC.csv
./snp_stocks/PWR.csv
./snp_stocks/PYPL.csv
./snp_stocks/QCOM.csv
./snp_stocks/QRVO.csv
./snp_stocks/RCL.csv
./snp_stocks/REG.csv
./snp_stocks/REGN.csv
./snp_stocks/RF.csv
./snp_stocks/RJF.csv
./snp_stocks/RL.csv
./snp_stocks/RMD.csv
./snp_stocks/ROK.csv
./snp_stocks/ROL.csv
./snp_stocks/ROP.csv
./snp_stocks/ROST.csv
./snp_stocks/RSG.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/RTX.csv
./snp_stocks/RVTY.csv
./snp_stocks/SBAC.csv
./snp_stocks/SBUX.csv
./snp_stocks/SCHW.csv
./snp_stocks/SHW.csv
./snp_stocks/SJM.csv
./snp_stocks/SLB.csv
./snp_stocks/SMCI.csv
./snp_stocks/SNA.csv
./snp_stocks/SNPS.csv
./snp_stocks/SO.csv
./snp_stocks/SOLV.csv
Failed to process SOLV: warm_start can only be used where `y` has the same classes as in the previous call to fit. Previously got [0 1], `y` has [0]
./snp_stocks/SPG.csv
./snp_stocks/SPGI.csv
./snp_stocks/SRE.csv
./snp_stocks/STE.csv
./snp_stocks/STLD.csv
./snp_stocks/STT.csv
./snp_stocks/STX.csv
./snp_stocks/STZ.csv
./snp_stocks/SW.csv
Failed to process SW: Found array with 0 sample(s) (shape=(0, 16)) while a minimum of 1 is required by StandardScaler.
./snp_stocks/SWK.csv
./snp_stocks/SWKS.csv
./snp_stocks/SYF.csv
./snp_stocks/SYK.csv
./snp_stocks/SYY.csv
./snp_stocks/T.csv
./snp_stocks/TAP.csv
./snp_stocks/TDG.csv
./snp_stocks/TDY.csv
./snp_stocks/TECH.csv
./snp_stocks/TEL.csv
./snp_stocks/TER.csv
./snp_stock

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/WELL.csv
./snp_stocks/WFC.csv
./snp_stocks/WM.csv
./snp_stocks/WMB.csv
./snp_stocks/WMT.csv
./snp_stocks/WRB.csv
./snp_stocks/WST.csv
./snp_stocks/WTW.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


./snp_stocks/WY.csv
./snp_stocks/WYNN.csv
./snp_stocks/XEL.csv
./snp_stocks/XOM.csv
./snp_stocks/XYL.csv
./snp_stocks/YUM.csv
./snp_stocks/ZBH.csv
./snp_stocks/ZBRA.csv
./snp_stocks/ZTS.csv


# *Processing Results and Selecting Top Performers*

---

## Objectives:
1. **Identify Top Models**:
   - Analyze the performance metrics to select the top 2 models for parameter tuning.

2. **Highlight Best Stocks**:
   - Rank the S&P 500 stocks by model performance.
   - Select the top 10 stocks for a detailed performance report.

## Next Steps:
- Perform **hyperparameter tuning** on the top 2 models to optimize their accuracy and precision.
- Generate **detailed analytics** for the top 10 stocks to uncover patterns and insights for future trading strategies.

In [179]:
# filter the results by rubric requirements
res_rubric = res[(res['Accuracy'] >= 0.5014) & (res['Precision'] >= 0.5141)]
res_rubric

Unnamed: 0,Stock,Model,MSE,R2,Accuracy,Precision,0_precision,1_precision,0_recall,1_recall,Classification_Report
0,A,MLPClassifier,0.493560,-0.980486,0.506440,0.535005,0.478659,0.535005,0.514192,0.499512,"{'0': {'precision': 0.47865853658536583, 'reca..."
2,A,HistGradientBoostingClassifier,0.476558,-0.912264,0.523442,0.551020,0.495317,0.551020,0.519651,0.526829,"{'0': {'precision': 0.4953173777315297, 'recal..."
4,AAPL,MLPClassifier,0.485832,-0.960375,0.514168,0.574873,0.472680,0.574873,0.619318,0.426956,"{'0': {'precision': 0.47267996530789247, 'reca..."
5,AAPL,RandomForestClassifier,0.455951,-0.839801,0.544049,0.545502,0.142857,0.545502,0.001136,0.994345,"{'0': {'precision': 0.14285714285714285, 'reca..."
6,AAPL,HistGradientBoostingClassifier,0.488923,-0.972848,0.511077,0.556795,0.463874,0.556795,0.503409,0.517436,"{'0': {'precision': 0.4638743455497382, 'recal..."
...,...,...,...,...,...,...,...,...,...,...,...
1985,ZBH,RandomForestClassifier,0.463163,-0.855885,0.536837,0.557621,0.533493,0.557621,0.882295,0.161290,"{'0': {'precision': 0.5334928229665071, 'recal..."
1986,ZBH,HistGradientBoostingClassifier,0.442555,-0.773310,0.557445,0.533842,0.585202,0.533842,0.516320,0.602151,"{'0': {'precision': 0.5852017937219731, 'recal..."
1987,ZBH,LogisticRegression,0.458527,-0.837306,0.541473,0.666667,0.533224,0.666667,0.960435,0.086022,"{'0': {'precision': 0.5332235035694673, 'recal..."
1990,ZBRA,HistGradientBoostingClassifier,0.495621,-0.983368,0.504379,0.515057,0.493865,0.515057,0.508421,0.500505,"{'0': {'precision': 0.4938650306748466, 'recal..."


In [194]:
# sort the results by accuracy and fetch the top 50 results
res_sorted = res_rubric.sort_values(by='Accuracy', ascending=False)
res_sorted.iloc[:20]

Unnamed: 0,Stock,Model,MSE,R2,Accuracy,Precision,0_precision,1_precision,0_recall,1_recall,Classification_Report
806,GEV,HistGradientBoostingClassifier,0.25,0.0,0.75,1.0,0.666667,1.0,1.0,0.5,"{'0': {'precision': 0.6666666666666666, 'recal..."
1366,NTRS,HistGradientBoostingClassifier,0.407522,-0.630244,0.592478,0.576993,0.612903,0.576993,0.523469,0.662851,"{'0': {'precision': 0.6129032258064516, 'recal..."
567,DOW,LogisticRegression,0.410959,-0.676195,0.589041,0.565789,0.593103,0.565789,0.886598,0.195455,"{'0': {'precision': 0.593103448275862, 'recall..."
1903,VZ,LogisticRegression,0.415765,-0.690434,0.584235,0.55988,0.589297,0.55988,0.865631,0.220779,"{'0': {'precision': 0.5892968263845675, 'recal..."
941,IBM,RandomForestClassifier,0.418341,-0.673688,0.581659,0.59283,0.574138,0.59283,0.676829,0.483804,"{'0': {'precision': 0.5741379310344827, 'recal..."
972,INVH,MLPClassifier,0.423448,-0.705085,0.576552,0.606557,0.570481,0.606557,0.877551,0.222222,"{'0': {'precision': 0.5704809286898839, 'recal..."
917,HSIC,RandomForestClassifier,0.424523,-0.712333,0.575477,0.561702,0.579878,0.561702,0.805477,0.29932,"{'0': {'precision': 0.5798776342624066, 'recal..."
1901,VZ,RandomForestClassifier,0.425039,-0.728139,0.574961,0.531429,0.584538,0.531429,0.850091,0.219599,"{'0': {'precision': 0.5845380263984915, 'recal..."
987,IQV,LogisticRegression,0.425068,-0.70051,0.574932,0.581197,0.5703,0.581197,0.648115,0.5,"{'0': {'precision': 0.5703001579778831, 'recal..."
1838,UNH,HistGradientBoostingClassifier,0.426069,-0.705363,0.573931,0.584848,0.562566,0.584848,0.565539,0.58191,"{'0': {'precision': 0.562565720294427, 'recall..."


In [182]:

def find_best_models(df):
    counter = {}
    for i, row in df.iterrows():
        if row['Model'] in counter:
            counter[row['Model']].add(row['Stock'])
        else:
            counter[row['Model']] = set()
            counter[row['Model']].add(row['Stock'])
        for model1, s1 in counter.items():
            for model2, s2 in counter.items():
                if model1 == model2:
                    continue
                if len(s1 & s2) >= 10:
                    return (model1, model2, s1 & s2)

In [185]:
# Find the top 2 models and top 10 stocks that met the rubric requirements
m1, m2, stocks = find_best_models(res_sorted)
print('Selected Models: ', m1, m2)
print('Selected Stocks: ', stocks)

Selected Models:  LogisticRegression MLPClassifier
Selected Stocks:  {'MU', 'UPS', 'IQV', 'HST', 'PPG', 'TMUS', 'FE', 'MOS', 'CL', 'FTV'}


In [188]:
for stock in stocks:
    df = pd.read_csv(f'./snp_stocks/{stock}.csv')
    print(f'Data start date of {stock}: ', df['Date'][0])

Data start date of MU:  1984-06-01
Data start date of UPS:  1999-11-10
Data start date of IQV:  2013-05-09
Data start date of HST:  1980-03-17
Data start date of PPG:  1980-03-17
Data start date of TMUS:  2007-04-19
Data start date of FE:  1997-11-10
Data start date of MOS:  1988-01-26
Data start date of CL:  1973-05-02
Data start date of FTV:  2016-07-05
