# Inferentia

This notebooks tests out using the AWS Inferentia chip due to claims that is is much faster than even an A100 ([source](https://huggingface.co/blog/accelerate-transformers-with-inferentia2)).

## Resources

* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/
* https://github.com/aws-neuron/aws-neuron-sdk
* https://huggingface.co/docs/optimum-neuron/installation
* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/models/inference-inf2-trn1-samples.html#model-samples-inference-inf2-trn1

## Mistral 

Mistral is a decoder model ([source](https://huggingface.co/docs/transformers/main/en/model_doc/mistral)), and has inference supported ([source](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html#model-architecture-fit)).

* https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#mistral-gqa-code-sample

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/amazon-linux/torch-neuronx-al2.html#setup-torch-neuronx-al2
```
# Configure Linux for Neuron repository updates
sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

# Update OS packages 
sudo yum update -y

# Install OS headers 
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

# Install git 
sudo yum install git -y

# install Neuron Driver
sudo yum install aws-neuronx-dkms-2.* -y

# Install Neuron Runtime 
sudo yum install aws-neuronx-collectives-2.* -y
sudo yum install aws-neuronx-runtime-lib-2.* -y

# Install Neuron Tools 
sudo yum install aws-neuronx-tools-2.* -y

# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH
```

Install git lfs
https://stackoverflow.com/questions/71448559/git-large-file-storage-how-to-install-git-lfs-on-aws-ec2-linux-2-no-package
```
sudo amazon-linux-extras install epel -y 
sudo yum-config-manager --enable epel
sudo yum install git-lfs -y
```


https://huggingface.co/docs/optimum-neuron/installation
```
export TMPDIR=/mnt/efs/data/AIEresearch/.venv_tmp
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
python -m pip install optimum[neuronx]
```

https://github.com/aws-neuron/transformers-neuronx#installation
```
pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
```

In [1]:
# Imports 
import torch
from transformers_neuronx import constants
from transformers_neuronx.mistral.model import MistralForSampling
from transformers_neuronx.module import save_pretrained_split
from transformers_neuronx.config import NeuronConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

In [2]:
# Load and save the CPU model with bfloat16 casting
model_cpu = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1')
save_pretrained_split(model_cpu, 'mistralai/Mistral-7B-Instruct-v0.1-split')

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [3]:
# Set sharding strategy for GQA to be shard over heads
neuron_config = NeuronConfig(
    grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
)

In [4]:
# Create and compile the Neuron model
model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1-split', batch_size=1, \
    tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
model_neuron.to_neuron()

2024-01-24 09:00:49.000397:  92186  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 09:00:49.000403:  92186  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/3bdf9b03-c2ac-4403-8dc0-08defb00d2b1/model.MODULE_46fb50b64488cef81ba9+2c2d707e.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_compile_workdir/3bdf9b03-c2ac-4403-8dc0-08defb00d2b1/model.MODULE_46fb50b64488cef81ba9+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-01-24 09:00:49.000553:  92189  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 09:00:49.000557:  92187  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 09:00:49.000558:  92189  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_wor

In [5]:
# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1')

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [6]:
# Set message
message = "[INST] How many parts does Medicare have? [/INST]"

In [7]:
# Encode message
encoded_input = tokenizer(message, return_tensors='pt')

In [8]:
# Run inference
with torch.inference_mode():
    generated_sequences = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
    # print([tokenizer.decode(tok) for tok in generated_sequence])

2024-Jan-24 09:04:07.0052 88243:95922 [1] init.cc:96 CCOM WARN Linux kernel 4.14 requires setting FI_EFA_FORK_SAFE=1 environment variable.  Multi-node support will be disabled.
Please restart with FI_EFA_FORK_SAFE=1 set.


In [11]:
outputs = []
for tok in generated_sequences:
    output = tokenizer.decode(tok)
    if '[/INST]' in output:
        outputs.append(output.split('[/INST]')[-1].split('</s>')[0])
    else:
        outputs.append(output)
    print("".join(outputs))

 Medicare is a health insurance program for people over the age of 65 and certain people with disabilities or certain health conditions. It is divided into two main parts:

1. Medicare Part A (Hospital Insurance): This part of Medicare covers inpatient hospital services, inpatient skilled nursing facility services, home health care services, and hospice care. It is funded by payroll taxes paid by employees and employers, as well as by premiums paid by retirees.
2. Medicare Part B (Medical Insurance): This part of Medicare covers outpatient medical services, such as doctor visits, preventive services, and prescription drugs. It is funded by premiums paid by retirees, as well as by premiums paid by enrollees aged 65 and older in certain income and wealth categories.


---
> Above version is working, ignoring this one for now.

https://huggingface.co/aws-neuron/Mistral-neuron
```
python -m pip install git+https://github.com/aws-neuron/transformers-neuronx.git
```

In [1]:
# using python instead of git clone because I know this supports lfs on the DLAMI image
from huggingface_hub import Repository
repo = Repository(local_dir="Mistral-neuron", clone_from="aws-neuron/Mistral-neuron")

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/mnt/efs/data/AIEresearch/demo_medicare_handbook/huggingface_demo/Mistral-neuron is already a clone of https://huggingface.co/aws-neuron/Mistral-neuron. Make sure you pull the latest changes with `repo.git_pull()`.


In [2]:
import torch
from transformers_neuronx import constants
from transformers_neuronx.mistral.model import MistralForSampling
from transformers_neuronx.module import save_pretrained_split
from transformers_neuronx.config import NeuronConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

In [3]:
# Set sharding strategy for GQA to be shard over heads
neuron_config = NeuronConfig(
    grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
)

In [4]:
# define the model.  These are the settings used in compilation.
# If you want to change these settings, skip to "Compilation of other Mistral versions"
model_neuron = MistralForSampling.from_pretrained("Mistral-neuron", 
                                                  batch_size=1, 
                                                  tp_degree=2, 
                                                  n_positions=256, 
                                                  amp='bf16', 
                                                  neuron_config=neuron_config)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [None]:
# load the neff files from the local directory instead of compiling
model_neuron.load("Mistral-neuron")

In [None]:
# load the neff files into the neuron processors.  
# you can see this process happening if you run neuron-top from the command line in another console.
# if you didn't do the previous load command, this will also compile the neff files
model_neuron.to_neuron()

## Llama2

Llama is a decoder model ([source](https://huggingface.co/docs/transformers/model_doc/llama2)), and has inference supported ([source](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/model-architecture-fit.html#model-architecture-fit)).

* https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/llama-70b-sampling.ipynb

```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```

In [7]:
# Imports
import os
import time
import torch
from transformers import LlamaForCausalLM, AutoTokenizer
from transformers_neuronx.module import save_pretrained_split
from transformers_neuronx.llama.model import LlamaForSampling

In [3]:
# Construct model
# model = LlamaForCausalLM.from_pretrained('Llama-2-70b')
model = LlamaForCausalLM.from_pretrained('/mnt/efs/data/saved_models/Llama-2-70b-chat-hf/model')

Loading checkpoint shards:   0%|          | 0/29 [00:00<?, ?it/s]

In [6]:
# Save model as state_dict for ingestion
save_pretrained_split(model, '/mnt/efs/data/saved_models/Llama-2-70b-split')

In [8]:
os.environ['NEURON_CC_FLAGS'] = '--enable-mixed-precision-accumulation'

In [9]:
# Load meta-llama/Llama-2-70b to the NeuronCores with 8-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained(
    '/mnt/efs/data/saved_models/Llama-2-70b-split',  # Should reference the split checkpoint produced by "save_pretrained_split"
    batch_size=1,           # Batch size must be determined prior to inference time.
    tp_degree=24,           # Controls the number of NeuronCores to execute on. Change to 32 for trn1.32xlarge
    amp='f16',              # This automatically casts the weights to the specified dtype.
)
neuron_model.to_neuron()



2024-01-24 11:43:00.000442:  22035  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 11:43:00.000447:  22035  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/9ed376a3-81c0-4189-97c1-a423d72e3f39/model.MODULE_e5ea2e91b07024a4c5a6+ce379c2c.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_compile_workdir/9ed376a3-81c0-4189-97c1-a423d72e3f39/model.MODULE_e5ea2e91b07024a4c5a6+ce379c2c.neff', '--enable-mixed-precision-accumulation', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-01-24 11:43:00.000584:  22046  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-24 11:43:00.000589:  22046  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/5516528c-50e8-4404-9680-83c087363832/model.MODULE_3f7073201620d

In [30]:
# Construct a tokenizer and encode prompt text
# tokenizer = AutoTokenizer.from_pretrained('Llama-2-70b')
# tokenizer = AutoTokenizer.from_pretrained('Llama-2-70b', 
#                                           token='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')
# tokenizer = AutoTokenizer.from_pretrained('upstage/Llama-2-70b-instruct',) 
tokenizer = AutoTokenizer.from_pretrained('/mnt/efs/data/saved_models/Llama-2-70b-chat-hf/tokenizer')

In [31]:
prompt = "How many parts does Medicare have?"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

In [32]:
# Run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

generated sequences ["<s> How many parts does Medicare have?\nMedicare has four parts: Part A, Part B, Part C, and Part D.\nPart A: Hospital Insurance\nPart A helps cover inpatient care in hospitals, skilled nursing facilities, and hospice care. Most people don't pay a premium for Part A because they or their spouse paid Medicare taxes while working.\nPart B: Medical Insurance\nPart B helps cover outpatient care, such as doctor visits, lab tests, and preventive services. Most people pay a premium for Part B, unless they qualify for financial assistance.\nPart C: Medicare Advantage (MA) Plans\nPart C includes Medicare Advantage (MA) plans, which are offered by private companies approved by Medicare. MA plans combine Part A and Part B benefits, and often include extra benefits like dental and vision coverage. You must have Part A and Part B to join a Medicare Advantage plan.\nPart D: Prescription Drug Coverage\nPart D helps cover the cost of prescription drugs. These plans are offered by

In [33]:
outputs = []
for tok in generated_sequences[0].split('\n')[1:]:
    outputs.append(tok)
print("".join(outputs))

Medicare has four parts: Part A, Part B, Part C, and Part D.Part A: Hospital InsurancePart A helps cover inpatient care in hospitals, skilled nursing facilities, and hospice care. Most people don't pay a premium for Part A because they or their spouse paid Medicare taxes while working.Part B: Medical InsurancePart B helps cover outpatient care, such as doctor visits, lab tests, and preventive services. Most people pay a premium for Part B, unless they qualify for financial assistance.Part C: Medicare Advantage (MA) PlansPart C includes Medicare Advantage (MA) plans, which are offered by private companies approved by Medicare. MA plans combine Part A and Part B benefits, and often include extra benefits like dental and vision coverage. You must have Part A and Part B to join a Medicare Advantage plan.Part D: Prescription Drug CoveragePart D helps cover the cost of prescription drugs. These plans are offered by private companies approved by Medicare. You must have Part A or Part B to joi

In [23]:
gs = ["<s> How many parts does Medicare have?\nA: Medicare has four parts:\n\n1. Part A (Hospital Insurance): Covers inpatient hospital stays, skilled nursing care, hospice, and home health care.\n2. Part B (Medical Insurance): Covers outpatient services, such as doctor visits, lab tests, and preventive care.\n3. Part C (Medicare Advantage): Offers Medicare benefits through private insurance companies, often with additional benefits like dental and vision coverage.\n4. Part D (Prescription Drug Coverage): Helps cover the cost of prescription drugs.\n\nIt's worth noting that Medicare Advantage plans (Part C) and Prescription Drug plans (Part D) are offered by private insurance companies, so the specifics of these plans can vary depending on the provider.\n\nI hope that helps! Let me know if you have any other questions.</s>"]
gs = ['<s> How many parts does Medicare have?\n\nAnswer: Medicare has four parts:\n\n1. Part A (Hospital Insurance)\n2. Part B (Medical Insurance)\n3. Part C (Medicare Advantage)\n4. Part D (Prescription Drug Coverage)</s>']
gs = ["<s> How many parts does Medicare have?\n\nOriginal Medicare has two parts: Part A (Hospital Insurance) and Part B (Medical Insurance). Part A generally covers hospital stays, skilled nursing care, hospice, and home health services. Part B covers outpatient services, such as doctor visits, procedures, and preventive care.\n\nHowever, Medicare also offers additional parts or programs, including:\n\n1. Part C (Medicare Advantage): This is a type of Medicare health plan offered by private insurance companies that contract with Medicare. Medicare Advantage plans often include additional benefits, such as dental, vision, and prescription drug coverage.\n2. Part D (Prescription Drug Coverage): This part helps cover the cost of prescription drugs. You can join a Medicare Prescription Drug Plan if you have Part A or Part B.\n3. Medicare Supplement Insurance (Medigap): This is sold by private insurance companies to help cover some of the out-of-pocket costs of Original Medicare, such as deductibles, copayments, and coinsurance.\n4. Medicare Premium Part B.\n\nSo, in summary, there are four main parts of Medicare: Part A, Part B, Part C (Medicare Advantage), and Part D (Prescription Drug Coverage). Additionally, there's Medicare Supplement Insurance (Medigap), which is not a part of Medicare but rather a supplemental insurance sold by private companies to help cover some of the costs of Original Medicare.</s>"]

In [24]:
gs[0].split('\nA: ')[1:]

[]

In [22]:
gs[0].split('\nA: ')[1:]

["Medicare has four parts:\n\n1. Part A (Hospital Insurance): Covers inpatient hospital stays, skilled nursing care, hospice, and home health care.\n2. Part B (Medical Insurance): Covers outpatient services, such as doctor visits, lab tests, and preventive care.\n3. Part C (Medicare Advantage): Offers Medicare benefits through private insurance companies, often with additional benefits like dental and vision coverage.\n4. Part D (Prescription Drug Coverage): Helps cover the cost of prescription drugs.\n\nIt's worth noting that Medicare Advantage plans (Part C) and Prescription Drug plans (Part D) are offered by private insurance companies, so the specifics of these plans can vary depending on the provider.\n\nI hope that helps! Let me know if you have any other questions.</s>"]

In [16]:
def cheese():
    outputs = []
    for tok in gs[0].split('\nA: ')[1:]:
        tok = tok.replace('</s>', '')
        outputs.append(tok)
        yield "".join(outputs)

In [19]:
cheese()

<generator object cheese at 0x7f5a481b7220>

In [21]:
for c in cheese():
    print(c)

Medicare has four parts:

1. Part A (Hospital Insurance): Covers inpatient hospital stays, skilled nursing care, hospice, and home health care.
2. Part B (Medical Insurance): Covers outpatient services, such as doctor visits, lab tests, and preventive care.
3. Part C (Medicare Advantage): Offers Medicare benefits through private insurance companies, often with additional benefits like dental and vision coverage.
4. Part D (Prescription Drug Coverage): Helps cover the cost of prescription drugs.

It's worth noting that Medicare Advantage plans (Part C) and Prescription Drug plans (Part D) are offered by private insurance companies, so the specifics of these plans can vary depending on the provider.

I hope that helps! Let me know if you have any other questions.


In [20]:
next(chs)

StopIteration: 

In [11]:
outputs = []
for tok in gs[0].split('\nA: ')[1:]:
    tok = tok.replace('</s>', '')
    outputs.append(tok)
    print("".join(outputs))

Medicare has four parts:

1. Part A (Hospital Insurance): Covers inpatient hospital stays, skilled nursing care, hospice, and home health care.
2. Part B (Medical Insurance): Covers outpatient services, such as doctor visits, lab tests, and preventive care.
3. Part C (Medicare Advantage): Offers Medicare benefits through private insurance companies, often with additional benefits like dental and vision coverage.
4. Part D (Prescription Drug Coverage): Helps cover the cost of prescription drugs.

It's worth noting that Medicare Advantage plans (Part C) and Prescription Drug plans (Part D) are offered by private insurance companies, so the specifics of these plans can vary depending on the provider.

I hope that helps! Let me know if you have any other questions.
