# NLP Feature Generation Using PrimoGPT Model

This notebook implements the feature generation process using our custom-trained PrimoGPT model. Unlike the training data preparation script, this notebook:
1. Uses the trained PrimoGPT model from HuggingFace (`custom_gpt=True`)
2. Generates features for trading, not training (`is_for_train=False`)
3. Creates production-ready features for the trading environment

## Key Parameters
- `is_for_train=False`: Disables training-specific prompt templates
- `custom_gpt=True`: Uses our fine-tuned PrimoGPT model instead of GPT-4o
- `model_id="One2Many/PrimoGPT-Instruct"`: HuggingFace model path

## Feature Generation Process
1. Downloads historical stock data and news
2. Processes each trading day through PrimoGPT to generate features:
   - News relevance (0-2)
   - Sentiment (-1 to 1)
   - Price impact potential (-3 to 3)
   - Trend direction (-1 to 1)
   - Earnings impact (-2 to 2)
   - Investor confidence (-3 to 3)
   - Risk profile change (-2 to 2)

## Output Format
Generated features are saved in CSV format with columns:
- Date
- Stock price data
- Generated NLP features
- Raw news and press releases (for reference)

### This cell bellow is for package installation on Google Colab

In [1]:
#!pip install langchain
#!pip install langchain_openai
#!pip install python-dotenv
#!pip install huggingface_hub
#!pip install transformers
#!pip install peft
#!pip install yfinance
#!pip install finnhub-python
#!pip install datasets
#!pip install openai

#!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

#from torch import __version__; from packaging.version import Version as V
#xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
#!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

In [2]:
# Import required modules and set up paths
import sys
sys.path.append('../../')

import json
import os
from primogpt.create_prompt import *
from primogpt.prepare_data import *

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
# Define stock symbol and date range
stock_symbol = "CRM"
start_date = "2021-09-01"
end_date = "2024-07-31"

# Create directory for storing generated features
data_dir = f"/data/{stock_symbol}_{start_date}_{end_date}"
os.makedirs(data_dir, exist_ok=True)

In [4]:
# Download and prepare raw data
# This includes stock prices, news, and press releases
prepare_data_for_symbol(stock_symbol, data_dir, start_date, end_date)

[*********************100%***********************]  1 of 1 completed


News done
Press releases done


Unnamed: 0_level_0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-09-02,2021-09-02,263.395386,-0.015541,D2,[],"[{""date"": ""2021-09-02 09:00:00"", ""headline"": ""..."
2021-09-03,2021-09-03,266.317017,0.011092,U2,[],[]
2021-09-07,2021-09-07,264.452362,-0.007002,D1,[],[]
2021-09-08,2021-09-08,261.869781,-0.009766,D1,[],"[{""date"": ""2021-09-08 08:00:00"", ""headline"": ""..."
2021-09-09,2021-09-09,259.995117,-0.007159,D1,[],[]
...,...,...,...,...,...,...
2024-07-24,2024-07-24,249.779999,-0.024525,D3,"[{""date"": ""20240724160008"", ""headline"": ""The h...",[]
2024-07-25,2024-07-25,256.519989,0.026984,U3,"[{""date"": ""20240725180038"", ""headline"": ""The n...",[]
2024-07-26,2024-07-26,262.709991,0.024131,U3,"[{""date"": ""20240728074500"", ""headline"": ""Predi...",[]
2024-07-29,2024-07-29,258.589996,-0.015683,D2,"[{""date"": ""20240729214520"", ""headline"": ""Sales...",[]


In [5]:
# Load the raw data file
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)

# Display the first 50 rows of the data
df = pd.read_csv(csv_file_path)
df.head(50)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
0,2021-09-02,263.395386,-0.015541,D2,[],"[{""date"": ""2021-09-02 09:00:00"", ""headline"": ""..."
1,2021-09-03,266.317017,0.011092,U2,[],[]
2,2021-09-07,264.452362,-0.007002,D1,[],[]
3,2021-09-08,261.869781,-0.009766,D1,[],"[{""date"": ""2021-09-08 08:00:00"", ""headline"": ""..."
4,2021-09-09,259.995117,-0.007159,D1,[],[]
5,2021-09-10,256.46524,-0.013577,D2,"[{""date"": ""20210912015212"", ""headline"": ""Outpe...",[]
6,2021-09-13,253.384079,-0.012014,D2,"[{""date"": ""20210913164600"", ""headline"": ""Sales...",[]
7,2021-09-14,253.643326,0.001023,U1,"[{""date"": ""20210914164500"", ""headline"": ""Sales...",[]
8,2021-09-15,255.428223,0.007037,U1,"[{""date"": ""20210915164600"", ""headline"": ""Sales...",[]
9,2021-09-16,259.616211,0.016396,U2,"[{""date"": ""20210916164600"", ""headline"": ""Sales...","[{""date"": ""2021-09-16 16:30:00"", ""headline"": ""..."


In [6]:
# Display sample data to verify content
news_content = df.loc[1, 'News']
news_content_json = json.loads(news_content)

print("News:")
for news_item in news_content_json:
    print(f"Date: {news_item['date']}, Headline: {news_item['headline']}, Summary: {news_item['summary']}\n")

News:


In [7]:
# Display sample data to verify content
press_releases = df.loc[4, 'PressReleases']
press_releases_json = json.loads(press_releases)

print("Press Releases:")
for release in press_releases_json:
    print(f"Date: {release['date']}, Headline: {release['headline']}, Description: {release['description']}\n")

Press Releases:


In [8]:
# Login to HuggingFace
from huggingface_hub import login
login(token="SECRET_TOKEN")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [9]:
# Generate features using PrimoGPT
# is_for_train=False: Use production prompt template
# custom_gpt=True: Use our fine-tuned model
results = process_stock_data(stock_symbol, data_dir, start_date, end_date, is_for_train=False, custom_gpt=True)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
100%|██████████| 729/729 [50:45<00:00,  4.18s/it]


In [10]:
# Save results to CSV
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}_gpt.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)

# Display the first 50 rows of the results
df = pd.read_csv(csv_file_path)
df.head(50)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News Relevance,Sentiment,Price Impact Potential,Trend Direction,Earnings Impact,Investor Confidence,Risk Profile Change,Prompt
0,2021-09-02,263.395386,-0.015541,D2,1,0,-1,0,0,-1,0,\n [COMPANY BASICS]\n ...
1,2021-09-03,266.317017,0.011092,U2,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
2,2021-09-07,264.452362,-0.007002,D1,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
3,2021-09-08,261.869781,-0.009766,D1,2,1,1,1,1,2,0,\n [COMPANY BASICS]\n ...
4,2021-09-09,259.995117,-0.007159,D1,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
5,2021-09-10,256.46524,-0.013577,D2,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
6,2021-09-13,253.384079,-0.012014,D2,1,0,-1,0,0,-1,0,\n [COMPANY BASICS]\n ...
7,2021-09-14,253.643326,0.001023,U1,2,1,1,1,1,2,0,\n [COMPANY BASICS]\n ...
8,2021-09-15,255.428223,0.007037,U1,1,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
9,2021-09-16,259.616211,0.016396,U2,2,1,2,1,1,2,0,\n [COMPANY BASICS]\n ...
