# NLP Feature Generation Using PrimoGPT Model

This notebook implements the feature generation process using our custom-trained PrimoGPT model. Unlike the training data preparation script, this notebook:
1. Uses the trained PrimoGPT model from HuggingFace (`custom_gpt=True`)
2. Generates features for trading, not training (`is_for_train=False`)
3. Creates production-ready features for the trading environment

## Key Parameters
- `is_for_train=False`: Disables training-specific prompt templates
- `custom_gpt=True`: Uses our fine-tuned PrimoGPT model instead of GPT-4o
- `model_id="One2Many/PrimoGPT-Instruct"`: HuggingFace model path

## Feature Generation Process
1. Processes each trading day through PrimoGPT to generate features:
   - News relevance (0-2)
   - Sentiment (-1 to 1)
   - Price impact potential (-3 to 3)
   - Trend direction (-1 to 1)
   - Earnings impact (-2 to 2)
   - Investor confidence (-3 to 3)
   - Risk profile change (-2 to 2)

## Output Format
Generated features are saved in CSV format with columns:
- Date
- Stock price data
- Generated NLP features
- Raw news and press releases (for reference)

### This cell bellow is for package installation on Google Colab

In [1]:
!pip install langchain
!pip install langchain_openai
!pip install python-dotenv
!pip install huggingface_hub
!pip install transformers
!pip install peft
!pip install yfinance
!pip install finnhub-python
!pip install datasets
!pip install openai

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

Collecting langchain_openai
  Downloading langchain_openai-0.3.10-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-core<1.0.0,>=0.3.48 (from langchain_openai)
  Downloading langchain_core-0.3.48-py3-none-any.whl.metadata (5.9 kB)
Collecting openai<2.0.0,>=1.68.2 (from langchain_openai)
  Downloading openai-1.68.2-py3-none-any.whl.metadata (25 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading langchain_openai-0.3.10-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-0.3.48-py3-none-any.whl (418 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m418.7/418.7 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading openai-1.68.2-py3-none-any.whl (606 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

Collecting xformers
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl (43.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 MB[0m [31m47.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
Successfully installed xformers-0.0.29.post3


In [2]:
from google.colab import drive
import pandas as pd
import sys
import os

drive.mount('/content/drive')
sys.path.insert(0, '/content/drive/MyDrive/Colab')

from primogpt.create_prompt import *
from primogpt.prepare_data import *

Mounted at /content/drive
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
# Define stock symbol and date range
stock_symbol = "NFLX"
start_date = "2022-04-01"
end_date = "2025-02-28"

In [4]:
data_dir = '/content/drive/MyDrive/Colab'
csv_file_path = os.path.join(data_dir, 'NFLX_2022-04-01_2025-02-28.csv')

df = pd.read_csv(csv_file_path)
df.head(30)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
0,2022-04-04,391.5,0.048277,U5,[],[]
1,2022-04-05,380.149994,-0.028991,D3,[],[]
2,2022-04-06,368.350006,-0.03104,D4,[],[]
3,2022-04-07,362.149994,-0.016832,D2,[],[]
4,2022-04-08,355.880005,-0.017313,D2,"[{""date"": ""20220410084408"", ""headline"": ""Tradi...",[]
5,2022-04-11,348.0,-0.022142,D3,"[{""date"": ""20220411174500"", ""headline"": ""3 Top...",[]
6,2022-04-12,344.100006,-0.011207,D2,"[{""date"": ""20220412162300"", ""headline"": ""All E...",[]
7,2022-04-13,350.429993,0.018396,U2,"[{""date"": ""20220413170735"", ""headline"": ""Netfl...",[]
8,2022-04-14,341.130005,-0.026539,D3,"[{""date"": ""20220414164602"", ""headline"": ""Netfl...",[]
9,2022-04-18,337.859985,-0.009586,D1,"[{""date"": ""20220418160028"", ""headline"": ""5 Rea...",[]


In [5]:
# Login to HuggingFace
from huggingface_hub import login
login(token="SECRET_TOKEN")

In [6]:
# Generate features using PrimoGPT
# is_for_train=False: Use production prompt template
# custom_gpt=True: Use our fine-tuned model
# data si saved with method inside functions
results = process_stock_data(stock_symbol, data_dir, start_date, end_date, is_for_train=False, custom_gpt=True)

==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2025.3.18 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
100%|██████████| 727/727 [44:07<00:00,  3.64s/it]


In [7]:
# Display the first 50 rows of the results
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}_gpt.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)

df = pd.read_csv(csv_file_path)
df.head(50)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News Relevance,Sentiment,Price Impact Potential,Trend Direction,Earnings Impact,Investor Confidence,Risk Profile Change,Prompt
0,2022-04-04,391.5,0.048277,U5,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
1,2022-04-05,380.149994,-0.028991,D3,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
2,2022-04-06,368.350006,-0.03104,D4,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
3,2022-04-07,362.149994,-0.016832,D2,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
4,2022-04-08,355.880005,-0.017313,D2,0,0,0,0,0,0,0,\n [COMPANY BASICS]\n ...
5,2022-04-11,348.0,-0.022142,D3,2,-1,-2,-1,-1,-2,-1,\n [COMPANY BASICS]\n ...
6,2022-04-12,344.100006,-0.011207,D2,2,0,-1,0,-1,-1,-1,\n [COMPANY BASICS]\n ...
7,2022-04-13,350.429993,0.018396,U2,1,0,1,1,0,0,-1,\n [COMPANY BASICS]\n ...
8,2022-04-14,341.130005,-0.026539,D3,2,0,-2,-1,-1,-2,-1,\n [COMPANY BASICS]\n ...
9,2022-04-18,337.859985,-0.009586,D1,2,-1,-2,-1,-1,-2,-1,\n [COMPANY BASICS]\n ...
