<a href="https://colab.research.google.com/github/Sawera557/Openelm-Colab-Testing/blob/main/Testing_of_OpenELM_by_Sawera.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. OpenELM models are pretrained using the CoreNet library.OpenElm models are released both pretrained and instruction tuned models with 270M, 450M, 1.1B and 3B parameters.

####Install Dependencies

In [None]:
!pip -q install git+https://github.com/huggingface/transformers --progress-bar off
!pip install -q datasets loralib sentencepiece --progress-bar off
!pip -q install bitsandbytes accelerate xformers einops --progress-bar off

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


####Login to Hugging Face

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

####Initialize Tokenizer and Model

Make sure you are are authenticated to utlize Llama2-7b-hf tokenizer from huggingface

From this link: https://huggingface.co/meta-llama/Llama-2-13b-hf



In [None]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

####Set Configuration and Device

In [None]:

from transformers import BitsAndBytesConfig
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)


model = AutoModelForCausalLM.from_pretrained("apple/OpenELM-1_1B-Instruct",
                                             device_map=device,
                                             torch_dtype=torch.float16,
                                             use_auth_token=True,
                                             trust_remote_code=True,
                                             quantization_config=bnb_config,
                                             low_cpu_mem_usage=True
                                             )



config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

configuration_openelm.py:   0%|          | 0.00/14.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/apple/OpenELM-1_1B-Instruct:
- configuration_openelm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_openelm.py:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/apple/OpenELM-1_1B-Instruct:
- modeling_openelm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/2.16G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

####Define Prompt Preparation Function





In [None]:
def prepare_prompt(prompt:str):
  tokens = tokenizer(prompt)
  tokenized_prompt = torch.tensor(
        tokens['input_ids'],
        device = device
    )
  return tokenized_prompt.unsqueeze(0)

def generate(prompt:str, model:AutoModelForCausalLM, max_length:int = 128):
  tokenized_prompt = prepare_prompt(prompt)
  output_ids = model.generate(
        tokenized_prompt,
        max_length=max_length,
        pad_token_id=0,
        assistant_model = model)
  output_text = tokenizer.decode(
        output_ids[0].tolist(),
        skip_special_tokens=True
    )
  return output_text

####Generate Text for Prompt 1

In [None]:
%%time
prompt = "Write names of the 15 large language models LLMs \n"
print(generate(prompt, model, 300))

Write names of the 15 large language models LLMs 

1. Alberta (Google)
2. BERT (Bidirectional Encoder Representations from Transformers) (Lemon et al., 2018)
3. BLEU (Bleu, 1998)
4. BERT-Base (Devlin et al., 2018)
5. BERT-Large (Devlin et al., 2018)
6. BERT-Multilingual (Devlin et al., 2018)
7. BERT-Tiny (Devlin et al., 2018)
8. BERT-WS (Devlin et al., 2018)
9. BERT-XL (Devlin et al., 2018)
10. BERT-C (Chu et al., 2019)
11. BERT-CZ (Chu et al., 2019)
12. BERT-Multilingual-C (Chu et al., 2019)
13. BERT-Multilingual-CZ (Chu et al., 2019)
14. BERT-Large-C (Chu et al., 2
CPU times: user 23.1 s, sys: 787 ms, total: 23.9 s
Wall time: 29.2 s


####Generate Text for Prompt 2

In [None]:
%%time
prompt = "Generate python code to transfer dataframe to Google big query pandas library\n"
generated_text = generate(prompt, model, 500)
print(generated_text)

Generate python code to transfer dataframe to Google big query pandas library

I am trying to write a python function to transfer a pandas DataFrame to Google BigQuery.

I have the following code:

```python
import pandas as pd
import gbq_client
import os

def gbq_transfer_to_bigquery(path_to_data, table_name, df):
    gbq_client.initialize()
    gbq_service = gbq_client.BigQueryService(project_id=path_to_data.split("-")[-1])

    df_gbq = df.togbq(gbq_service, table_name=table_name)

    try:
        gbq_service.upload_data(path_to_data, df_gbq)
    except Exception as e:
        print(f"Error uploading to BigQuery: {e}")

    gbq_service.close()


def transfer_to_bigquery(path_to_data, table_name, df):
    df_gbq = gbq_transfer_to_bigquery(path_to_data, table_name, df)
    df_gbq.to_sql(f"{path_to_data}_bigquery", con=path_to_data.split("-")[-2] + "_sqlserver", if_exists="replace", index=False)

    return df_gbq


def main():
    path_to_data = "path/to/input_data"
    table_name = 

####Generate Text for Prompt 3

In [None]:
%%time
prompt = 'Name few Text to image open source AI models \n'
generated_text = generate(prompt, model, 300)
print(generated_text)

Name few Text to image open source AI models 

1. [Tesseract](https://tesseract-ocr.org/) - Open-source OCR (Optical Character Recognition) engine developed by Google and Microsoft. Tesseract is a free and open-source OCR engine developed by Google and is available under the GNU General Public License (GPL). Tesseract is designed to work with various input formats, including scanned images, handwritten text, and barcodes.

2. [OpenCV-Text](https://github.com/opencv-contrib/opencv-text) - OpenCV Text is a collection of OpenCV modules designed to process images and text data. The OpenCV Text repository provides various text processing algorithms, including OCR (Optical Character Recognition), text detection, and text segmentation. The project is developed by the OpenCV (Open Source Computer Vision) project, which is a collaboration between Google, Microsoft, and Intel.

3. [TensorFlow Text](https://github.com/google/tf-text) - TensorFlow Text is a collection of TensorFlow libraries desig