In [5]:
## Role name fetched using "aws iam list-roles --query 'Roles[?contains(RoleName, `SageMaker`)]'" in terminal

## Defining sagemaker roles

In [1]:
import boto3
# Create an IAM client
iam = boto3.client('iam')
# List roles
response = iam.list_roles()
# If you want to filter for SageMaker roles, you can do it in Python
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
# `sagemaker_roles` now contains a list of SageMaker roles

In [2]:
sagemaker_roles[0]['RoleName']

'AmazonSageMaker-ExecutionRole-20231030T210397'

In [3]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20231030T210397')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ravi.tej/Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ravi.tej/Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ravi.tej/Library/Application Support/sagemaker/config.yaml


Couldn't call 'get_role' to get Role ARN from role name ravi_tej to get Role path.


sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ravi.tej/Library/Application Support/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::005418323977:role/service-role/AmazonSageMaker-ExecutionRole-20231030T210397
sagemaker bucket: sagemaker-ap-south-1-005418323977
sagemaker session region: ap-south-1


In [4]:
import pandas as pd
import numpy as np
import json

## Data Prep

### Prompt

In [65]:
new_system_prompt =  '''
You are the chief editor for a leading Indian financial and business news website. You evaluate critical attributes of articles to gate keep content quality. For many attributes, you will first provide a brief analysis of 15 to 30 words, followed by assessment.

1. analysis_is_financial_or_business_news (short text) : <analyse if article pertains to finance/business or not. government policies directly impacting indian corporations or investors are ok, but not if aren't>
2. is_financial_or_business_news (True or False) : <True or False based on previous attribute>
3. analysis_of_relevant_for_india (short text) : <analyse if article is relevant for indians. for example articles about 401k or small foreign companies won't be relevant for india. however changes to fed interest rates or nasdaq or large multinational important news will be relevant>
4. relevant_for_india (True or False) : <True or False based on previous attribute>
5. analysis_of_article_validity_duration (short text) : <Analyse relevance duration: Stock fluctuations, 1 day; significant policy changes, weeks; educational content is timeless unless it refers to any tax or other regulations in which case only 30 days. International news in India has shorter lifespan. popular topics are usually not timeless; quarterly analysis is valid for a week, yearly for a couple of weeks and a much longer one for a month>
6. article_validity_duration (one of 1, 3, 7, 14, 30, -1) : <calculate number of days based on previous attribute. -1: timeless. 1: article is relevant only for that day. 3: for a couple of days. 7: for a week. 14: for a couple of weeks. 30: for a month>
7. analysis_of_popularity (short text) : <analyse likely popularity of article - if its for niche audience, moderate_popularity or should be part of breaking_news section, depending on number of people who will be impacted by the news and the scale of the event. foreign entities known in india but not very popular will be mostly niche or rarely moderately popular. articles targeted to very specific business or pratices will be niche. infotainment business and financial articles with some drama are likely to be more popular. articles with a list of rules without compelling story-telling will be for niche audience>
8. popularity (one of niche, moderately_popular, breaking_news) : <based on previous attribute>
9. analysis_of_article_type (short text) : <analyse if the article is majorly factual, is an opinion piece, analysis, educational or likely sponsored. factual articles relay events. opinion pieces have predictions either from the author or from statements without data. analysis pieces have substantial data to justify. if an article is overly zealous on certain stock and seems like an ad, then it is sponsored>
10. article_type (one of fact, opinion, analysis, educational, sponsored) : <based on previous attribute>
11. analysis_of_article_sentiment (short_text): <analyse if the sentiment of the article is bullish, bearish or NA. balanced is NA>
12. article_sentiment (one of bull, bear, NA): <based on previous attribute>
13. headline_suggestion (short text) : <Write a headline based on the content of the article>
14. summary (text of 60 words) : <Generate concise, entity-dense summary. The summary should become highly dense but easily understood without the Article. Don't keep the summary too short, but limit it to no more than 60 words>

your response should be a json structure with all the 14 above keys without missing any key. It is very important that the response is directly readable with json.loads(). no preamble or postamble. respond in the exact following structure:

{
"analysis_is_financial_or_business_news": "",
"is_financial_or_business_news": "",
"analysis_of_relevant_for_india": "",
"relevant_for_india": "",
"analysis_of_article_validity_duration": "",
"article_validity_duration": "",
"analysis_of_popularity": "",
"popularity": "",
"analysis_of_article_type": "",
"article_type": "",
"analysis_of_article_sentiment": "",
"article_sentiment": "",
"headline_suggestion": "",
"summary": ""
}
'''

### Loading the Dataset

In [6]:
import re
from transformers import AutoTokenizer
from random import randint
import sys
sys.path.append("../scripts/utils")
from pack_dataset import pack_dataset
from datasets import Dataset

### Defining the model

In [7]:
model_id = "teknium/OpenHermes-2.5-Mistral-7B"
# model_id = 'ehartford/dolphin-2.0-mistral-7b'
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [59]:
import requests

### Formatting the GPT responses for prompt output

In [8]:
with open('train_set_full_27_nov.json', 'r') as f:
    train_set = json.load(f)

In [9]:
# structuring the format
for art_id in train_set:
    res = train_set[art_id]['response']
    train_set[art_id] = {'content': train_set[art_id]['content'], 'response': {}}
    train_set[art_id]['response']['attributes'] = res[0]
    train_set[art_id]['response']['summaries'] = res[1]

In [20]:
def correct_validity_duration(val):
    val = int(val)
    valid_days = [-1, 1, 3, 7, 14, 30]
    if val in valid_days:
        return val
    else:
        for i in valid_days:
            if val > i:
                valid_value = i
        return valid_value

In [21]:
correct_validity_duration('365')

30

In [22]:
for art_id in train_set:
    train_set[art_id]['unified_response'] = train_set[art_id]['response']['attributes']
    train_set[art_id]['unified_response']['summary'] = train_set[art_id]['response']['summaries'][0]['denser_summary']
    train_set[art_id]['unified_response']['is_financial_or_business_news'] = True if int(train_set[art_id]['response']['attributes']['is_financial_or_business_news']) == 1 else False if int(train_set[art_id]['response']['attributes']['is_financial_or_business_news']) == 0 else None
    train_set[art_id]['unified_response']['relevant_for_india'] = True if int(train_set[art_id]['response']['attributes']['relevant_for_india']) == 1 else False if int(train_set[art_id]['response']['attributes']['relevant_for_india']) == 0 else None
    train_set[art_id]['unified_response']['article_validity_duration'] = correct_validity_duration(train_set[art_id]['response']['attributes']['article_validity_duration'])

In [216]:
train_set['6555c7e14b13023f9348e982']['content']

'Convenience Fee: Definition Examples and How to Avoid Them: What Is a Convenience Fee A convenience fee is a fee charged by a seller when a consumer pays with an electronic payment card rather than by a standard form of payment accepted by the business. Standard payments include cash check or an Automated Clearing House (ACH) transfer. Convenience fees can be a fixed dollar amount or a percentage of the transaction amount usually 2% to 3% and must be disclosed to the consumer in advance. Types of payments where the payee typically charges a convenience fee include mortgage payments property tax payments college tuition and taxes. Understanding a Convenience Fee Convenience fees can help a business cover some of the costs imposed through electronic payment processing. Businesses have to pay a merchant fee every time one of their customers uses a credit card. For most businesses such as department stores and grocery stores a merchant fee is just a cost of doing business. On the other ha

In [212]:
train_set.keys()

dict_keys(['651de33fa662d76276b803d9', '6555c2124b13023f9348d13c', '65367ff01e5cc42b1b143d0d', '651e1ddca662d76276b892c7', '652ebbb81e5cc42b1b1399e8', '6555b49f4b13023f9348b6d6', '65316f831e5cc42b1b1404ea', '65309c0a1e5cc42b1b13a6b6', '651e17dfa662d76276b884de', '651e2032a662d76276b8985d', '6555c1124b13023f9348ce01', '6556d9654b13023f934aee38', '6540c74a2936d70acf71e4e7', '651de21ea662d76276b800b4', '6536800e1e5cc42b1b143f17', '655b3e4c4b13023f934af3c3', '651de220a662d76276b800b9', '652ebbdc1e5cc42b1b139b7a', '65316fb61e5cc42b1b14140a', '6555c8154b13023f9349066e', '6555c99c4b13023f93492ae9', '6555cb5b4b13023f9349cbff', '6555c7df4b13023f9348e8ab', '6555c4794b13023f9348d8af', '6555c80e4b13023f934901f0', '652ebba21e5cc42b1b139964', '6555c37f4b13023f9348d497', '654856d2dc4fa72a6c403a46', '6555bfd74b13023f9348c8a6', '6555c0614b13023f9348cb14', '6555c7c64b13023f9348e0e6', '6555ca964b13023f93494a79', '653168c31e5cc42b1b13b8a4', '6555c8214b13023f93490e78', '6555c4754b13023f9348d89a', '651e19e6

In [27]:
a = [train_set[k]['unified_response']['article_sentiment'] for k in train_set.keys()]

In [28]:
set(a)

{'NA', 'bear', 'bull'}

In [66]:
messages = [
    {"role": "system", "content": "You are Hermes 2."},
    {"role": "user", "content": "Hello, who are you?"}
]

In [115]:
tokenizer.encode('<|im_end|>')

[1, 32000]

In [93]:
tokenizer.decode(tokenizer.apply_chat_template(messages, add_generation_prompt=False))

'<|im_start|> system\nYou are Hermes 2.<|im_end|> \n<|im_start|> user\nHello, who are you?<|im_end|> \n'

In [29]:
def format_prompt(train_row):
    instruction = f"<|im_start|>system\n{new_system_prompt}<|im_end|>\n"
    # not adding context as instruction ends with |actual_article|
    context = f"### <|im_start|>user\n{train_row['content']}|im_end|>\n"
    response = f"### <|im_start|>assistant\n{train_row['response']}"
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    prompt = re.sub(r'\n+','\n',prompt)
    return prompt

### Article Truncation logic

In [101]:
def calculate_tokens(text, encoder):  # Placeholder for your actual token calculation function
    # Your implementation will go here.
    return len(encoder.encode(text))  # Example: counting characters as tokens

def truncate_text_to_token_limit(text,encoder, token_limit):
    # First, check if the whole text is under the token limit
    if calculate_tokens(text, encoder) <= token_limit:
        return text  # The entire text is within the limit

    def is_under_limit(index):
        # Use the provided function to calculate tokens for the substring
        return calculate_tokens(text[:index], encoder) <= token_limit

    left, right = 0, len(text)
    valid_limit = 0  # This will hold the index of the last valid token position

    # Binary search to find the token limit
    while left <= right:
        mid = (left + right) // 2  # Find the midpoint
        if is_under_limit(mid):
            # If the midpoint is under the limit, store it as a valid limit
            valid_limit = mid
            left = mid + 1  # Move the left boundary to the right
        else:
            right = mid - 1  # Move the right boundary to the left

    # Find the last space before the valid_limit to ensure we're at a word boundary
    space_index = text.rfind(' ', 0, valid_limit)
    if space_index == -1:
        # If there's no space, we've hit the start of the text
        return text[:valid_limit]  # Return up to the valid limit even if mid-word

    # Return the text up to the last word within the token limit
    return text[:space_index]

### Article Size distribution

In [86]:
article_sizes = {k: len(tokenizer.encode(train_set[k]['content'])) for k in train_set}

In [105]:
article_sizes

{'651de33fa662d76276b803d9': 110,
 '6555c2124b13023f9348d13c': 710,
 '65367ff01e5cc42b1b143d0d': 765,
 '651e1ddca662d76276b892c7': 688,
 '652ebbb81e5cc42b1b1399e8': 874,
 '6555b49f4b13023f9348b6d6': 1053,
 '65316f831e5cc42b1b1404ea': 2289,
 '65309c0a1e5cc42b1b13a6b6': 799,
 '651e17dfa662d76276b884de': 376,
 '651e2032a662d76276b8985d': 1143,
 '6555c1124b13023f9348ce01': 191,
 '6556d9654b13023f934aee38': 717,
 '6540c74a2936d70acf71e4e7': 2195,
 '651de21ea662d76276b800b4': 320,
 '6536800e1e5cc42b1b143f17': 1380,
 '655b3e4c4b13023f934af3c3': 642,
 '651de220a662d76276b800b9': 361,
 '652ebbdc1e5cc42b1b139b7a': 3510,
 '65316fb61e5cc42b1b14140a': 511,
 '6555c8154b13023f9349066e': 1538,
 '6555c99c4b13023f93492ae9': 143,
 '6555cb5b4b13023f9349cbff': 1726,
 '6555c7df4b13023f9348e8ab': 204,
 '6555c4794b13023f9348d8af': 503,
 '6555c80e4b13023f934901f0': 1898,
 '652ebba21e5cc42b1b139964': 438,
 '6555c37f4b13023f9348d497': 395,
 '654856d2dc4fa72a6c403a46': 214,
 '6555bfd74b13023f9348c8a6': 862,
 '655

In [104]:
len(train_set['6555b5a44b13023f9348bdef']['content'].split(' '))

3531

In [92]:
OUTPUT_TOKEN_LIMIT = 700
INSTRUCTION_TOKENS = len(tokenizer.encode(new_system_prompt))

BUFFER_TOKENS = 21

ARTICLE_TOKEN_LIMIT = 4096 - OUTPUT_TOKEN_LIMIT - INSTRUCTION_TOKENS - BUFFER_TOKENS

ARTICLE_TOKEN_LIMIT

In [117]:
# making dataset it ready for training
modified_train_set = {}
for art_id in train_set:
    modified_train_set[art_id] = {}
    # train_set[key]['text'] = format_prompt(train_set[key])
    # modified_train_set[art_id]['text'] = train_set[key]['text']
    # train_set[key]['article_id'] = key
    modified_train_set[art_id]['article_id'] = art_id
    modified_train_set[art_id]['content'] = train_set[art_id]['content']
    modified_train_set[art_id]['response'] = json.dumps(train_set[art_id]['unified_response'])
    # train_set[key]['unified_response'] = json.dumps(train_set[key]['unified_response'])

In [19]:
# # making dataset it ready for training
# for key in train_set:
#     train_set[key]['text'] = format_prompt(train_set[key])
#     train_set[key]['article_id'] = key
#     train_set[key]['response'] = json.dumps(train_set[key]['response'])

In [118]:
train_df = pd.DataFrame.from_dict(modified_train_set).T

In [204]:
dataset = Dataset.from_pandas(train_df)

### Setting up prompt in ChatML Format

In [120]:
def format_text_response_as_prompt(train_row):
    truncated_content = truncate_text_to_token_limit(text=train_row['content'], encoder=tokenizer, token_limit=ARTICLE_TOKEN_LIMIT)
    messages = [{"role": "system", "content": new_system_prompt},
                {"role": "user", "content": f"|article_start|\n {truncated_content}\n|article_end|\n"}]
    context_prompt = tokenizer.decode(tokenizer.apply_chat_template(messages, add_generation_prompt=False))
    prompt = context_prompt + train_row['response']
    prompt = re.sub(r'\n+','\n',prompt)
    return prompt

In [100]:
len(tokenizer.encode(format_text_response_as_prompt(train_df[train_df.article_id == '6555b54e4b13023f9348ba8b'].iloc[0])))

3753

### Create chunks

In [179]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

1

In [205]:
# template dataset to add prompt to each sample
def template_dataset(sample):
    # sample["text"] = f"{format_prompt(sample)}{tokenizer.eos_token}"
    sample["text"] = f"{format_text_response_as_prompt(sample)}{tokenizer.eos_token}"
    return sample

dataset = dataset.map(template_dataset)
# tokenize dataset
dataset = dataset.map(
    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)
# new_column = dataset['input_ids']
# dataset = dataset.add_column("labels", new_column)
# chunk dataset
lm_dataset = pack_dataset(dataset, chunk_length=4096) # We use 4096 as the maximum length for packing
print(f"Total number of samples: {len(lm_dataset)}")


Map:   0%|          | 0/721 [00:00<?, ? examples/s]

Map:   0%|          | 0/721 [00:00<?, ? examples/s]

Chunking dataset into chunks of 4096 tokens.


Map:   0%|          | 0/721 [00:00<?, ? examples/s]

Total number of samples: 407
Total number of samples: 407


In [191]:
tokenizer(dataset[0]['text'],max_length=4096, truncation=True,padding='max_length')

{'input_ids': [32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002, 32002

In [206]:
lm_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 407
})

In [37]:
sess

<sagemaker.session.Session at 0x14eaf1890>

In [155]:
sess.default_bucket()

'sagemaker-ap-south-1-005418323977'

In [26]:
# !pip install s3fs

### Saving training data to s3

In [207]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/fine_tuning_datasets/gpt4-samples/train-full-29-nov-packed'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

Saving the dataset (0/1 shards):   0%|          | 0/407 [00:00<?, ? examples/s]

uploaded data to:
training dataset to: s3://sagemaker-ap-south-1-005418323977/fine_tuning_datasets/gpt4-samples/train-full-29-nov-packed


In [208]:
lm_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 407
})

### Hyperparamters

In [209]:
from huggingface_hub import HfFolder


# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'num_train_epochs': 2,                            # number of training epochs
  'per_device_train_batch_size': 5,                 # batch size for training
  'gradient_accumulation_steps': 3,                 # Number of updates steps to accumulate
  'gradient_checkpointing': True,                   # save memory but slower backward pass
  'bf16': True,                                     # use bfloat16 precision
  'tf32': True,                                     # use tf32 precision
  'learning_rate': 5e-4,                            # learning rate
  'max_grad_norm': 0.3,                             # Maximum norm (for gradient clipping)
  'warmup_ratio': 0.03,                             # warmup ratio
  "lr_scheduler_type":"cosine_with_restarts",                   # learning rate scheduler
    'weight_decay': 0.1,
  'save_strategy': "epoch",                         # save strategy for checkpoints
  "logging_steps": 10,                              # log every x steps
  'merge_adapters': True,                           # wether to merge LoRA into the model (needs more memory)
  'use_flash_attn': True,                           # Whether to use Flash Attention
  'output_dir': '/tmp/run',                         # output directory, where to save assets during training
                                                    # could be used for checkpointing. The final trained
                                                    # model will always be saved to s3 at the end of training
}

if HfFolder.get_token() is not None:
    hyperparameters['hf_token'] = HfFolder.get_token() # huggingface token to access gated models, e.g. llama 2

In [210]:
from sagemaker.huggingface import HuggingFace

# define Training Job Name
job_name = f'huggingface-qlora-{hyperparameters["model_id"].replace("/","-").replace(".","-")}-full-29-Nov-unpacked'

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_qlora.py',    # train script
    source_dir           = './utils/',      # directory which includes all the files needed for training
    instance_type        = 'ml.g5.12xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    max_run              = 6*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
    disable_output_compression = True         # not compress output to save training time and cost
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ravi.tej/Library/Application Support/sagemaker/config.yaml


### Training job

In [211]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-teknium-OpenHermes-2--2023-11-29-17-17-59-738


Using provided s3_resource
2023-11-29 17:18:01 Starting - Starting the training job...
2023-11-29 17:18:28 Starting - Preparing the instances for training......
2023-11-29 17:19:31 Downloading - Downloading input data...
2023-11-29 17:19:51 Training - Downloading the training image..............................
2023-11-29 17:24:43 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-11-29 17:25:37,512 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-11-29 17:25:37,566 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-11-29 17:25:37,574 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-11-29 17:25:37,576 sagemaker_pytorch_container.training INFO     Invoking user training 

### Old Training Job

In [48]:
# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-teknium-OpenHermes-2--2023-11-29-09-07-51-205


Using provided s3_resource
2023-11-29 09:07:52 Starting - Starting the training job......
2023-11-29 09:08:31 Starting - Preparing the instances for training...
2023-11-29 09:09:25 Downloading - Downloading input data...
2023-11-29 09:09:40 Training - Downloading the training image........................
2023-11-29 09:13:52 Training - Training image download completed. Training in progress.......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-11-29 09:14:48,195 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-11-29 09:14:48,249 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-11-29 09:14:48,258 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-11-29 09:14:48,260 sagemaker_pytorch_container.training INFO     Invoking user training scri

### Deployment

In [218]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.1.0",
  session=sess,
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

INFO:sagemaker.image_uris:Defaulting to only available Python version: py39
INFO:sagemaker.image_uris:Defaulting to only supported image scope: gpu.


llm image uri: 763104351884.dkr.ecr.ap-south-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04


In [219]:
llm_image

'763104351884.dkr.ecr.ap-south-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04'

In [220]:
role

'arn:aws:iam::005418323977:role/service-role/AmazonSageMaker-ExecutionRole-20231030T210397'

In [221]:
len(tokenizer.encode(a))

365

In [172]:
config

Available objects for config:
    AliasManager
    DisplayFormatter
    HistoryManager
    IPCompleter
    IPKernelApp
    LoggingMagics
    MagicsManager
    OSMagics
    PrefilterManager
    ScriptMagics
    StoreMagics
    ZMQInteractiveShell


In [224]:
huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

's3://sagemaker-ap-south-1-005418323977/huggingface-qlora-teknium-OpenHermes-2--2023-11-29-17-17-59-738/output/model/'

#### Download model to local

In [180]:
import boto3
import os

# Initialize a boto3 S3 client
s3 = boto3.client('s3')

In [181]:
import boto3
import os

# Initialize a boto3 S3 client
s3 = boto3.client('s3')

# S3 bucket and folder details
bucket_name = 'sagemaker-ap-south-1-005418323977'
s3_folder = 'huggingface-qlora-teknium-OpenHermes-2--2023-11-29-05-23-47-562'

# Local directory to save files
local_folder = './hermes_full_finetuned_model/'

# List objects within the specified S3 folder
objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=s3_folder)

# Download each file in the folder
for obj in objects.get('Contents', []):
    s3_file_path = obj['Key']
    local_file_path = os.path.join(local_folder, s3_file_path[len(s3_folder):])
    os.makedirs(os.path.dirname(local_file_path), exist_ok=True)
    s3.download_file(bucket_name, s3_file_path, local_file_path)
    print(f'Downloaded {s3_file_path} to {local_file_path}')

# Remember to replace 'your-bucket-name', 'your-folder-name/', and 'path/to/local/folder/' with your actual bucket name, S3 folder, and local folder path.


OSError: [Errno 30] Read-only file system: '/debug-output'

In [64]:
import transformers

In [65]:
from transformers import AutoModel, AutoConfig
from huggingface_hub import HfFolder

In [66]:
import transformers

In [67]:
transformers.__version__

'4.35.2'

In [12]:
!pip install --upgrade git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /private/var/folders/d4/cgyr_gnj7nn2wy_hq40gkq8c0000gq/T/pip-req-build-fvqwwsub
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /private/var/folders/d4/cgyr_gnj7nn2wy_hq40gkq8c0000gq/T/pip-req-build-fvqwwsub
  Resolved https://github.com/huggingface/transformers to commit 3bc50d81e6c70d63e59d635106bac6a561b47681
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25ldone
[?25h  Created wheel for transformers: filename=transformers-4.36.0.dev0-py3-none-any.whl size=8048947 sha256=38109c4d4a4b05baf5d143f483a397e952fdded8c2a4b4dc9c36e75e3483b40d
  Stored in directory: /private/var/folders/d4/cgyr_gnj7nn2

In [4]:
!pip install -U transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [68]:
import torch

### Upload to HF

In [69]:
model = AutoModel.from_pretrained('./hermes_finetuned_model/', local_files_only = True, torch_dtype = torch.float16)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [70]:
model.push_to_hub('WintWealth/partial_finetuned_open_hermes_2.5', token = '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', private=True)

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.28G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/WintWealth/partial_finetuned_open_hermes_2.5/commit/e48ca5f4ad711b31dfa6ea52d45a0e548e5a9adb', commit_message='Upload model', commit_description='', oid='e48ca5f4ad711b31dfa6ea52d45a0e548e5a9adb', pr_url=None, pr_revision=None, pr_num=None)

### Deploy

In [225]:
import json
from sagemaker.huggingface import HuggingFaceModel

# s3 path where the model will be uploaded
# if you try to deploy the model to a different time add the s3 path here
# model_s3_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]
model_s3_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(3072), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(4096), # Max length of the generation (including input text)
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}},
  env=config
)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ravi.tej/Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/ravi.tej/Library/Application Support/sagemaker/config.yaml


In [226]:

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2023-12-01-00-40-04-949
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2023-12-01-00-40-05-774
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2023-12-01-00-40-05-774


--------!

In [227]:
endpoint_name = llm.endpoint_name

In [228]:
endpoint_name

'huggingface-pytorch-tgi-inference-2023-12-01-00-40-05-774'

In [53]:
endpoint_name = llm.endpoint_name

In [54]:
endpoint_name

'huggingface-pytorch-tgi-inference-2023-11-29-10-43-04-737'

In [55]:
# endpoint_name = 'huggingface-pytorch-tgi-inference-2023-11-28-06-16-30-408'

In [83]:
# system_prompt = df.iloc[0].system_prompt

In [57]:
def format_article_for_prompt(article_text):
    instruction = f"<|im_start|>system\n{new_system_prompt}<|im_end|>\n"
    # not adding context as instruction ends with |actual_article|
    context = f"### <|im_start|>user\n{article_text}|im_end|>\n"
    response = f"### <|im_start|>assistant\n"
    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
    prompt = re.sub(r'\n+','\n',prompt)
    return prompt

In [58]:
parent_folder = '/Users/ravi.tej/Desktop/ML/Recommendations/arcane/'
from hydra import compose, initialize
import os

import xml.etree.ElementTree as ET

tree = ET.parse('../../conf/application.run.xml')
root = tree.getroot()

envs_element = root.find('./configuration/envs')
for variable in envs_element.findall('env'):
    name = variable.get('name')
    value = variable.get('value')
    os.environ[name] = value

import sys
sys.path.append('/Users/ravi.tej/Desktop/ML/Recommendations/arcane/')

from src._utils import load_bertopic_model_from_hf

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [2]:
from src.articles.ArticleService import ArticleService

In [3]:
art = ArticleService._get_article_json_from_s3_and_api('652a045b50af0e25a9122fd2')

In [6]:
art['title'] + art['cleaned_text']

"Parasitic, blood-sucking 'alien-like' wasp found in Amazon, eats host from inside outNEW DELHI: Scientists have identified a horrifying and dangerous new species of parasitic wasp that feeds on the blood of its host and devours it from the inside out. The Daily Star reports, this alien-like insect, named Capitojoppa amazonica, was discovered in the Amazon, specifically within the Allpahuayo-Mishana National Reserve in Peru. The parasitic wasp reaches a size of approximately 1.7cm and possesses a tube-like organ that injects an egg into the host's body. It usually targets caterpillars, beetles, and spiders.The wasp was discovered as part of an extensive, ongoing research project. The research team used specialized tent-like traps to capture flying insects in the rainforest. Capitojoppa amazonica is just one out of 109 newly discovered species.Brandon Claridge from Utah State University, who is also the lead author of the study that describes Capitojoppa amazonica, explained to Live Sci

In [12]:
smr = sess.boto_session.client("sagemaker-runtime")

In [13]:
parameters = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["###", "</s>"],
}

In [14]:
content = '''
World economic issues cast shadow on Indian stock market, FIIs continue to sell
The Indian stock market is facing challenges amid a contracting global landscape. Despite high valuations, factors like high inflation, bond yields, and geopolitical tensions are affecting market sustainability.

 Indian stock market under pressure due to FII selling (File photo)
Indian stock market under pressure due to FII selling (File photo)
It’s been a long journey for the Indian stock market. It has been growing well, generating superior wealth in the ongoing century. The Nifty 500, the broader equity index, has provided a decent CAGR of 13.6% over about 23 years. A one-time investment of ₹1 lakh in December 2000 would have been ₹21 lakh today. From the latest intra high of 17,754.05, dated September 12, it went down by 7.25% on October 26, and on November 3, Nifty 50 closed at 17,000.95, which is 4.25% lower. A short-term break in the long-term growing market.

At the onset of the 21st century, the big picture for India was as a rising emerging market as the domestic economy had opened for world business. Initially, the focus was on infrastructure development, particularly in the areas of roads, power and realty, seen as basis fundamental to the new economy. However, it was often characterized as an elephant economy, substantial and steady, driven mainly by domestic demand, and not swiftly adapting to global opportunities. This placid perception has since shifted in the present decade, with the economy now recognized as a burgeoning force poised to become a global supply hub in the future. The multiplier effect is seen in varied sectors like Digital, Renewables, Electronics, Technology, Pharma to Chemical, while efficient working of government expenditure is also uplifting rural and domestic demand.

The Indian economy & fiscal situation are as strong as they have ever been. Projections indicate a stable 6.5% YoY GDP growth from FY24 to FY26, alongside a 5.25% fiscal deficit, even amid global economic deceleration. H1FY24 corporate earnings growth has been bumper, with PAT growth of top 100 large cap estimates at 35% YoY. While no intrinsic structural issues have been identified within India, global circumstances have instigated fluctuations in the stock market, leading to currency volatility. INR has depreciated against USD, 83.270 Friday closing from 82.140 at March-end.

The recent decline in the Indian stock market is predominantly driven by global factors. Notably, there is a conspicuous deceleration in the global economy, as evidenced by Europe's recession, with Germany, the region's foremost manufacturing hub, recording negative GDP growth for the past three quarters. In Asia, the engine growth of China is decelerating. Annual taker of 7% GDP growth, during pre-covid is forecast to settle to 4.5% in the future. It is leading the government to consider implementing a significant stimulus package to regain traction.

Amid a contracting global landscape, two nations, the US and India, are decoupling. In 2022, the US was projected to enter a recession in the latter part of 2023, however, it managed to avert this scenario through the implementation of a comprehensive $8 trillion COVID assistance package, along with fiscal and monetary stimulus measures introduced by the government between 2020 and 2023. These initiatives had far-reaching benefits, extending support to households, states, healthcare, businesses, and other institutions. Consequently, the likelihood of a recession has now significantly diminished. However, a slowdown is forecast, the annual GDP growth is estimated to reduce from 2.3% in CY23 to 1% in CY24 due to high fiscal deficit, interest rate and quantitative tightening by the US FED.

This is the primary issue of the global stock market, and the fallout of the world economy. The current global economic landscape stands in contrast to the elevated trajectory of the Indian stock market. It is becoming a challenge to hold the gains due to high FIIs selling in the last 3-4months. Even the optimistic H1 results are not supporting the market to sustain the momentum strong.

Despite long-standing imbalances in the economy, including high inflation, elevated bond yields, geopolitical tensions, and supply constraints, the Indian stock market, like the main indices, has maintained a high valuation. For example, the MSCI India Index has been trading at an average one-year forward P/E of 20.5x above the long-term of 18x. Today at 19.6x, a dichotomy in context to dollar terms with elevated bond yield trading at decades high of 5%. FIIs are cautious as interest rates are expected to stay high, in-stroke to the hawkish central bank view, and economic slowdown and moderation in future earnings are warranting a consolidation in prices and valuations.

The author, Vinod Nair is Head of Research at Geojit Financial Services

Disclaimer: The views and recommendations made above are those of individual analysts or broking companies, and not of Mint. We advise investors to check with certified experts before taking any investment decisions.
'''

In [131]:
# request = {"inputs": format_article_for_prompt(art.full_content), "parameters": parameters, "stream": False}

In [20]:
import re

In [21]:
request = {"inputs": format_article_for_prompt(content), "parameters": parameters, "stream": False}

In [24]:
resp = smr.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(request),
    ContentType="application/json",
)

In [33]:
k = resp['Body'].read()

json.loads(json.loads(k)[0]['generated_text'])['summaries']

In [None]:
15000

In [146]:
txt = '''
The article highlights the challenges faced by the Indian stock market in light of global economic issues. Despite India's strong economy and positive corporate earnings growth, foreign institutional investors (FIIs) are selling their shares due to the global economic deceleration. The contraction in the global economy, led by Europe's recession and China's decelerating economy, is affecting the Indian market. The recent fall in the stock market is attributed to global factors, not intrinsic structural issues within India. The article concludes by stating that despite the challenges, the Indian stock market has maintained a high valuation, which is leading to consolidation in prices and valuations
'''

In [147]:
len(txt.split(' '))

104

In [121]:
art.full_content

'Sanghi Industries share price Today Live Updates : Sanghi Industries sees upward trend in trading  Mint:  On the last day Sanghi Industries stock opened at ₹ 123 and closed at ₹ 122.65. The stock reached its highest point of ₹ 123 and lowest point of ₹ 120.45 during the day. The market capitalization of the company is ₹ 3129.62 crore. The 52-week high and low for the stock are ₹ 131.9 and ₹ 51.55 respectively. The BSE volume for the stock was 6082 shares. Disclaimer: This is an AI-generated live blog and has not been edited by LiveMint staff. The current days low price of Sanghi Industries stock is ₹ 120.45 and the high price is ₹ 123. The current data for Sanghi Industries stock shows that the stock price is ₹ 122.7. There has been a 0.04 percent change in the stock price with a net change of 0.05. The current data for Sanghi Industries stock shows that the price is ₹ 121.3 with a percent change of -1.1 and a net change of -1.35. This indicates that the stock has decreased in value b

### Delete endpoint