# Text Generation using OpenAI LLMs (GPT 3.5/GPT4)

<img src="../Assets/Images/LLM LS.png" width=1000>

---

- Installation & Importing Libraries
- Accessing OpenAI (Keys) & Setting up Client

---
- Chat Completion API (Basic Text Generation)
- Tokens
- Embeddings
- RAG
- Fine Tuning



- Time (75 minutes coding + 15 minutes QnA) 
- We'll also take questions at the end of each section
- Aim is to explore OpenAI features

## Installation & Importing Libraries

In [1]:
%pip install --upgrade pip --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -r ../requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [3]:
import openai #OpenAI python library
from openai import OpenAI #OpenAI Client
from configparser import ConfigParser #library to read the config file

import tiktoken #library to count tokens

import gradio as gr #library for gradio interface

from sklearn.metrics.pairwise import cosine_similarity #for calculating similarities between embeddings

from bs4 import BeautifulSoup #for extracting text from webpages
import requests

import PyPDF2 #for reading text from pdf 

from langchain.document_loaders import TextLoader #to load text for RAG
from langchain.text_splitter import RecursiveCharacterTextSplitter #to chunk data for RAG
from langchain.embeddings.openai import OpenAIEmbeddings #to crate embeddings for RAG
from langchain.vectorstores import FAISS # to store embeddings in a vector index
from sklearn.model_selection import train_test_split #for model finetuning

import json #for creating finetuning files

import numpy as np
from sklearn.cluster import KMeans # for clustering of text using embeddings

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## 
___

## Accessing OpenAI (Keys) & Setting up Client

__Question__ : Have you procured your OpenAI API key?

If not,  create one here - https://platform.openai.com/api-keys

Also, check if you have sufficient balance - https://platform.openai.com/account/billing/overview

__Step 1__ : Read and set the OpenAI API key in the environment

In [4]:
config_object = ConfigParser()
config_object.read("../config.ini")
openai.api_key = config_object["OPENAI"]["openai_key"] #read the api key from the config file

In [None]:
#openai.api_key="<Your API Key>"

There are several ways of storing the API key in the environment. You may choose as per your preference or your organisation's policy

__Step 2__: Initialize the OpenAI client. This serves as an interface to interact with OpenAI's services and APIs.

In [5]:
client = OpenAI(api_key=openai.api_key)

And we're ready! Let's try and make the first call!

In [6]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)
print(response.choices[0].message.content)

Hello! How can I assist you today?


## 
---

## Chat Completion API (Basic Text Generation)

### Introduction

<span style="font-size: 20px; color: orange">>><b>Generative AI, and LLMs specifically, is a General Purpose Technology that is useful for a variety of applications</b></span>

<span style="font-size: 16px;"><i>"LLMs can be, generally, thought of as a next word prediction model"</i></span>

<span style="font-size: 16px; color: blue"><b>What is an LLM?</b></span>

- LLMs are __machine learning models__ that have learned from __massive datasets__ of human-generated content, finding statistical patterns to replicate human-like abilities.

- __Foundation models__, also known as base models, have been trained on trillions of words for weeks or months using extensive compute power. These models have __billions of parameters__, which represent their memory and enable sophisticated tasks.

- __Interacting with LLMs differs from traditional programming paradigms. Instead of formalized code syntax, you provide natural language prompts to the models__.

- When you pass a __prompt__ to the model, it predicts the next words and generates a __completion__. This process is known as __inference__.


<span style="font-size: 16px; color: blue"> <b>Prompts, Completions and Inference!</b></span>

<img src="../Assets/Images/LLM Inference.png" width=800>

<span style="font-size: 16px; color: blue"> <b>Available OpenAI models</b></span>

---

__GPT 4__

__<u>(Production)</u>__

<u>Name        | Context Window    | Cut-off date      | Snapshot</u>

__gpt-4__       | 8,192 tokens      | Up to Sep 2021    | gpt-4-0613

__gpt-4-32k__   | 32,768 tokens     | Up to Sep 2021    | gpt-4-32k-0613

<b><u>(Preview)</b></u>

__gpt-4-turbo-preview__     | 128,000 tokens | Up to Dec 2023    | gpt-4-1106-preview

__gpt-4-vision-preview__    | 128,000 tokens | Up to Apr 2023    | gpt-4-1106-vision-preview

---

__GPT 3.5__

__gpt-3.5-turbo__ | 16,385 tokens | Up to Sep 2021 | gpt-3.5-turbo-1106

__gpt-3.5-turbo-instruct__ | 4,096 tokens | Up to Sep 2021

---

For more details, visit the official documentation -> https://platform.openai.com/docs/models

---

<span style="font-size: 14px; color: orange">__IMP__ : __"model"__ is passed as a parameter in the chat completions API</span>

---





<span style="font-size: 16px; color: blue"> <b>OpenAI Chat Messages (Prompt) Structure</b></span>

<span style="font-size: 16px;"><u>Role</u></span> : OpenAI allows for *three* roles/personas - 
1. __System__ : The overarching constraints/definitions/intructions of the system that the LLM should "remember"
2. __User__ : Any instruction a user wants to pass to the LLM
3. __Assistant__ : The response from the LLM

<span style="font-size: 16px;"><u>Content</u></span> : Any message or "prompt" of these personas are passes as "Content"



__Why is this important?__ : Makes it easier to adapt an LLM to store conversation history.

---

<span style="font-size: 14px; color: orange">__IMP__ : __"role"__ and __"content"__ are passed as a dictionary in the __"messages"__ parameter in the chat completions API </span>

---

__Let's try__!

In [7]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant knowledgeable in the field of Cricket."},
    {"role": "user", "content": "When did Australia win their first Cricket World Cup?"}
  ]
)

In [8]:
print(response.choices[0].message.content)

Australia won their first Cricket World Cup in 1987. They defeated England in the final held at Eden Gardens, Kolkata, India.


In [9]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant knowledgeable in the field of Cricket."},
    {"role": "user", "content": "When did Australia win their first Cricket World Cup?"},
    {"role": "assistant", "content": "Australia won their first Cricket World Cup in the year 1987. They defeated England in the final to clinch their maiden title in the tournament."},
    {"role": "user", "content": "How much did they score?"}
  ]
)

from pprint import pprint

pprint(response.choices[0].message.content)

('In the final of the 1987 Cricket World Cup, Australia scored 253 runs for '
 'the loss of 5 wickets in their 50 overs. England, in response, managed to '
 'score 246 runs for the loss of 8 wickets, resulting in Australia winning by '
 '7 runs.')


### Chat API Parameters

<span style="font-size: 20px; color: orange">>><b>"model"</b> and <b>"messages"</b> are the two required API parameters</span>

<span style="font-size: 16px; color: blue"> There are several other optional parameters that help configure the response</span>

---

__n__ : Number of responses you want the LLM to generate for the instriction

__max_tokens__ : Maximum number of tokens you want to restrict the Inference to (This includes both the prompt/messages and the completion)

__temperature__ : Temperature controls the "randomness" of the responses. Higher value increases the randomness; lower value makes the output deterministic (value between 0 and 2)

__top_p__ : The model considers the results of the tokens with top_p probability mass (value between 0 and 1)


<img src="../Assets/Images/Temperature - Top P.png" width=600>

<span style="font-size: 14px; color: orange"> __IMP__ : It is recommended to configure either one of "temperature" and "top_p" but not both</span>


__frequency_penalty__ : Penalize new tokens based on their existing frequency in the text so far *(Value between -2 and 2)*

__presence_penalty__ : Penalize new tokens based on whether they appear in the text so far *(Value between -2 and 2)*

__logprobs__ : Flag to return log probability of the generated tokens *(True/False)*

__logit_bias__ : Parameter to control the presence of particular tokens in the output *(Value between -100 and 100)

__response_format__ : Response of the model can be requested in a particular format *(Currently : JSON and Text)*

__seed__ : Beta feature for reproducible outputs (setting a seed value may produce the same output repeatedly)

__stop__ : End of Sequence tokens that will stop the generation

__stream__ : To receive partial message deltas *(True/False)*

__user__ : ID representing end user (This helps OpenAI detect abuse. May be mandatory for higher rate limits)

__tools__ : used in function calling

__tool_choice__ : used in function calling

---

In [10]:
def gpt_call(model:str="gpt-3.5-turbo",prompt:str="Have I provided any input",n:int=1,max_tokens:int=100,temperature:float=0.5,presence_penalty:float=0):


    response = client.chat.completions.create(
    model=model,
    messages=[
       {"role": "user", "content": prompt}
    ],
    max_tokens=max_tokens,
    temperature=temperature,
    presence_penalty=presence_penalty,
    n=n
    )
    
    output=''

    for i in response.choices:
        output+=str(i.message.content)+'\n------------\n'
        
    

    return output


    


In [12]:
print(gpt_call(prompt="Write a title for a workshop on openai API",n=2,temperature=0))

"Unlocking the Power of OpenAI: A Hands-On Workshop on Harnessing the OpenAI API"
------------
"Unlocking the Power of OpenAI: A Hands-On Workshop on Harnessing the OpenAI API"
------------



In [13]:

def text_to_uppercase(model,n,max_tokens,temperature,presence_penalty,prompt):
    return str(type(n)) + str(type(temperature))

model=gr.Radio(["gpt-4","gpt-3.5-turbo"], label="Select Model")
n=gr.Radio([1,2,3], label="Number of Responses")
max_tokens=gr.Slider(minimum=10, maximum=500, label="Maximum Tokens")
temperature=gr.Slider(minimum=0.0, maximum=1.0, label="Temperature")
prompt=gr.Text(label="Prompt")


iface = gr.Interface(fn=gpt_call, inputs=[model,prompt,n,max_tokens,temperature], outputs="text")
iface.launch()


Running on local URL:  http://127.0.0.1:7882

To create a public link, set `share=True` in `launch()`.




In [14]:
iface.close()

Closing server running on port: 7882


### Decoding the Response Object

In [15]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant knowledgeable in the field of Cricket."},
    {"role": "user", "content": "When did Australia win their first Cricket World Cup?"},
    {"role": "assistant", "content": "Australia won their first Cricket World Cup in the year 1987. They defeated England in the final to clinch their maiden title in the tournament."},
    {"role": "user", "content": "How much did they score?"}
  ]
)

In [16]:
print((response.model_dump_json(indent=5)))


{
     "id": "chatcmpl-953XSZ2VK1ePHHUnKC0hLsKRtjxiu",
     "choices": [
          {
               "finish_reason": "stop",
               "index": 0,
               "logprobs": null,
               "message": {
                    "content": "In the final of the 1987 Cricket World Cup, Australia scored 253 runs for the loss of 5 wickets in their allotted 50 overs. England, in response, could only manage 246 runs, falling short by 7 runs.",
                    "role": "assistant",
                    "function_call": null,
                    "tool_calls": null
               }
          }
     ],
     "created": 1710992270,
     "model": "gpt-3.5-turbo-0125",
     "object": "chat.completion",
     "system_fingerprint": "fp_4f0b692a78",
     "usage": {
          "completion_tokens": 51,
          "prompt_tokens": 77,
          "total_tokens": 128
     }
}


<img src="../Assets/Images/Chat Completion Object.png" width=1000>


### Streaming

In [17]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant knowledgeable in the field of Cricket."},
    {"role": "user", "content": "When did Australia win their first Cricket World Cup?"},
    {"role": "assistant", "content": "Australia won their first Cricket World Cup in the year 1987. They defeated England in the final to clinch their maiden title in the tournament."},
    {"role": "user", "content": "How much did they score?"}
  ],
  stream=True
)

for chunk in response:
  print(chunk.choices[0].delta.content)



In
 the
 final
 of
 the
 
198
7
 Cricket
 World
 Cup
,
 Australia
 scored
 
253
 runs
 for
 the
 loss
 of
 
5
 w
ickets
 in
 their
 
50
 overs
.
 England
,
 in
 response
,
 managed
 to
 score
 
246
 runs
,
 falling
 short
 by
 
7
 runs
,
 thus
 Australia
 winning
 the
 match
.
None


In [18]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": "Write an email requesting for a leave of absence"}
  ],
  stream=True
)

for chunk in response:
  print(chunk.choices[0].delta.content)


Subject
:
 Request
 for
 Leave
 of
 Abs
ence



Dear
 [
Manager
's
 Name
],


I
 am
 writing
 to
 request
 a
 leave
 of
 absence
 from
 work
 from
 [
start
 date
]
 to
 [
end
 date
].
 The
 reason
 for
 my
 request
 is
 [
state
 reason
,
 such
 as
 a
 family
 emergency
,
 medical
 issue
,
 personal
 matter
,
 etc
.]
.


I
 have
 already
 spoken
 to
 my
 team
 members
 and
 arranged
 for
 coverage
 of
 my
 responsibilities
 during
 my
 absence
.
 I
 will
 ensure
 that
 all
 pending
 tasks
 are
 completed
 before
 I
 leave
 and
 provide
 detailed
 instructions
 for
 my
 colleagues
 to
 follow
 in
 my
 absence
.


I
 understand
 that
 my
 absence
 may
 create
 inconvenience
 for
 the
 team
,
 and
 I
 apologize
 for
 any
 disruption
 this
 may
 cause
.
 I
 assure
 you
 that
 I
 will
 make
 every
 effort
 to
 minimize
 any
 impact
 on
 our
 projects
 and
 ensure
 a
 smooth
 transition
 during
 my
 absence
.


I
 kindly
 request
 your
 approval
 for
 this
 leave
 of
 absence
 and
 would
 ap

### JSON

In [21]:
prompt="generate the entire text for a blog on a cricket match. \"Title\" is a catchy and attractive title of the blog. The \"Heading\" is the heading for each point in the blog and the \"Body\" is the text for that heading.Output in a json structure"

In [22]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "user", "content": prompt}
  ],
  response_format={ "type": "json_object" }
)

print(response.choices[0].message.content)

{
  "Title": "Thrilling Cricket Match: A Nail-biting Encounter Between Rivals",
  "Heading1": "The Buildup",
  "Body1": "The anticipation was palpable as the date of the match approached. Both teams were in top form and fans were eagerly waiting for the clash between the two fierce rivals.",
  "Heading2": "The Toss",
  "Body2": "The toss was won by Team A, who elected to bat first. It was a crucial decision as the pitch was expected to deteriorate later in the day.",
  "Heading3": "Team A's Innings",
  "Body3": "Team A got off to a solid start, with their top order batsmen scoring freely. However, they suffered a collapse in the middle overs, with Team B's bowlers wreaking havoc. They managed to post a competitive total of 250 runs.",
  "Heading4": "Team B's Chase",
  "Body4": "Team B's chase got off to a shaky start, losing early wickets. However, their middle order batsmen steadied the ship and kept them in the game. The match went down to the wire, with Team B needing 10 runs off th

## 

---

## Tokens

<span style="font-size: 20px; color: orange">>><b>Tokens are the fundamental units of NLP</b></span>

<span style="font-size: 16px; color: blue"><b>These units are typically words, punctuation marks, or other meaningful substrings that make up the text</b></span>

Counting the number of tokens becomes important because - 
- Number of Tokens determine the amount of computation required and hence the cost you incur
- Context Window or the maximum number of tokens an LLM can process in one go is limited


### Counting Tokens

In [23]:
####num_tokens_from_string function to count number of tokens in a text string
####uses tiktoken to count number of tokens in a text string
####parameters: "string" is the text string, "encoding_name" is the encoding name to be used by tiktoken
####returns: num_tokens->number of tokens in the text string
####This function is used within extract_data, extract_page, extract_YT, extract_audio, extract_image functions
def num_tokens_from_string(string: str, encoding_name="cl100k_base") -> int: #### Function to count number of tokens in a text string ####
    encoding = tiktoken.get_encoding(encoding_name) #### Initialize encoding ####
    return len(encoding.encode(string)) #### Return number of tokens in the text string ####

In [24]:
num_tokens_from_string("Hello how are you?")

5

In [25]:
with open("../Assets/Data/alice_in_wonderland.txt") as f:
    AliceInWonderland = f.read()

num_tokens_from_string(AliceInWonderland)

38680

### Pricing



__gpt-3.5-turbo-0125__	    |  PROMPT - $0.50 / 1M tokens   |   RESPONSE - $1.50 / 1M tokens

__gpt-4__	                |   PROMPT - $30.00 / 1M tokens	|   RESPONSE - $60.00 / 1M tokens

__gpt-4-turbo__             |	PROMPT - $10.00 / 1M tokens	|   RESPONSE - $30.00 / 1M tokens

## 

---

## Embeddings

### Introduction

<span style="font-size: 20px; color: orange">>><b>Embeddings are vector representations of data that capture meaningful relationships between entities</b></span>

<span style="font-size: 16px; color: blue"><b>These units are typically words, punctuation marks, or other meaningful substrings that make up the text</b></span>

- All Machine Learning/AI models work with numerical data. Before the performance of any operation all text/image/audio/video data has to be transformed into a numerical representation

- As a general definition, embeddings are data that has been transformed into n-dimensional matrices for use in deep learning computations.

<img src="../Assets/Images/Embeddings.png" width=800>

<span style="font-size: 16px; color: blue"><b>OpenAI Embeddings</b></span>

__text-embedding-3-small__	| $0.02 / 1M tokens

__text-embedding-3-large__	| $0.13 / 1M tokens

__ada v2__	| $0.10 / 1M tokens

In [26]:

embeddings=client.embeddings.create(
  model="text-embedding-3-small",
  input="The food was delicious",
  encoding_format="float",
)


In [27]:
print(embeddings.model_dump_json(indent=4))

{
    "data": [
        {
            "embedding": [
                -0.019819789,
                -0.021811483,
                -0.06169395,
                -0.038838044,
                0.011288293,
                -0.032474335,
                -0.007814972,
                0.070437975,
                -0.008889758,
                -0.04471597,
                0.020682048,
                -0.030701242,
                0.005167476,
                -0.027980879,
                -0.009915967,
                -0.009472693,
                0.018993964,
                -0.021738617,
                -0.017390894,
                0.023438843,
                0.053872906,
                -0.0061329617,
                -0.023001643,
                -0.00004990622,
                -0.0068859193,
                0.03733213,
                -0.00096320896,
                -0.0014429159,
                -0.009928111,
                0.0076753106,
                0.017220872,
                -0.011

In [28]:
embeddings=client.embeddings.create(
  model="text-embedding-3-small",
  input=["The food was delicious","The ambience was nice","The service was ordinary"],
  encoding_format="float",
  dimensions=10
)

In [30]:
print(embeddings.model_dump_json(indent=4))

{
    "data": [
        {
            "embedding": [
                -0.16478635,
                -0.18134578,
                -0.5129379,
                -0.32290855,
                0.0938535,
                -0.2699992,
                -0.0649755,
                0.5856378,
                -0.073911525,
                -0.37177902
            ],
            "index": 0,
            "object": "embedding"
        },
        {
            "embedding": [
                -0.029102806,
                -0.34336767,
                -0.66657615,
                -0.37806797,
                0.08133914,
                -0.4815079,
                -0.12963039,
                0.060725518,
                -0.060477655,
                -0.17713675
            ],
            "index": 1,
            "object": "embedding"
        },
        {
            "embedding": [
                -0.27565208,
                0.1992017,
                -0.5211547,
                -0.37973946,
                -0.0

### Text Search

In [31]:
embeddings_q=client.embeddings.create(
  model="text-embedding-3-small",
  input="food",
  encoding_format="float",
  dimensions=10
)

In [32]:
query=embeddings_q.data[0].embedding

In [33]:
query

[-0.012315562,
 -0.16567738,
 -0.069231726,
 -0.10728083,
 0.4480219,
 -0.61836094,
 0.29532152,
 0.4434862,
 -0.24505134,
 -0.17046502]

In [34]:
d1=embeddings.data[0].embedding #"The food was delicious"
d2=embeddings.data[1].embedding #"The ambience was nice"
d3=embeddings.data[2].embedding #"The service was ordinary"

<img src="https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg">

In [35]:
cosine_similarity([query],[d1])

array([[0.6332542]])

In [36]:
cosine_similarity([query],[d2])

array([[0.51180575]])

In [37]:
cosine_similarity([query],[d3])

array([[0.28721449]])

__1300+ Towards DataScience Medium Articles Dataset__

Data Source - https://www.kaggle.com/datasets/meruvulikith/1300-towards-datascience-medium-articles-dataset

In [38]:
data=pd.read_csv("../Assets/Data/medium.csv")

In [39]:
data.head()

Unnamed: 0,Title,Text
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...


In [40]:
data.shape

(1391, 2)

In [41]:
trunc_data=data.iloc[0:100,:]

In [42]:
trunc_data.shape

(100, 2)

In [43]:
def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

In [44]:
trunc_data['embedding'] = trunc_data.Title.apply(lambda x: get_embedding(x, model='text-embedding-3-small'))

In [45]:
trunc_data

Unnamed: 0,Title,Text,embedding
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...,"[-0.024416513741016388, 0.019436374306678772, ..."
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o...","[-0.011570210568606853, -0.029422052204608917,..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...,"[-0.0054690418764948845, -0.024078190326690674..."
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...,"[-0.003941336181014776, -0.022551342844963074,..."
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...,"[-0.0020211779046803713, -0.000974284252151846..."
...,...,...,...
95,Data Scientist’s toolkit — How to gather data ...,Data Scientist’s toolkit — How to gather data ...,"[-0.026935778558254242, -0.02546488121151924, ..."
96,Deep Learning on a Budget,Introduction\n\nWhy?\n\nThere are many article...,"[-0.016848241910338402, -0.045139651745557785,..."
97,Generating Startup names with Markov Chains,Generating Startup names with Markov Chains\n\...,"[0.013092967681586742, -0.0023525876458734274,..."
98,A Recipe for using Open Source Machine Learnin...,A Recipe for using Open Source Machine Learnin...,"[-0.017321426421403885, -0.021059678867459297,..."


In [46]:
search_string="Deep Learning"

In [47]:
search_embedding=get_embedding(search_string)

In [48]:
trunc_data['relevance'] = trunc_data.embedding.apply(lambda x: float(cosine_similarity([search_embedding],[x])))

In [49]:
trunc_data

Unnamed: 0,Title,Text,embedding,relevance
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...,"[-0.024416513741016388, 0.019436374306678772, ...",0.300155
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o...","[-0.011570210568606853, -0.029422052204608917,...",0.344597
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...,"[-0.0054690418764948845, -0.024078190326690674...",0.109014
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...,"[-0.003941336181014776, -0.022551342844963074,...",0.169241
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...,"[-0.0020211779046803713, -0.000974284252151846...",0.388754
...,...,...,...,...
95,Data Scientist’s toolkit — How to gather data ...,Data Scientist’s toolkit — How to gather data ...,"[-0.026935778558254242, -0.02546488121151924, ...",0.313273
96,Deep Learning on a Budget,Introduction\n\nWhy?\n\nThere are many article...,"[-0.016848241910338402, -0.045139651745557785,...",0.719206
97,Generating Startup names with Markov Chains,Generating Startup names with Markov Chains\n\...,"[0.013092967681586742, -0.0023525876458734274,...",0.165855
98,A Recipe for using Open Source Machine Learnin...,A Recipe for using Open Source Machine Learnin...,"[-0.017321426421403885, -0.021059678867459297,...",0.387291


In [50]:
trunc_data.sort_values(by="relevance",ascending=False).iloc[0:10,:]

Unnamed: 0,Title,Text,embedding,relevance
96,Deep Learning on a Budget,Introduction\n\nWhy?\n\nThere are many article...,"[-0.016848241910338402, -0.045139651745557785,...",0.719206
79,Applied AI: Going From Concept to ML Components,Opening your mind to different ways of applyin...,"[-0.016721589490771294, -0.026498831808567047,...",0.482872
73,Transfer Learning Intuition for Text Classific...,Transfer Learning Intuition for Text Classific...,"[-0.022648293524980545, 0.0017097401432693005,...",0.478084
54,Reinforcement Learning Introduction,Reinforcement Learning Introduction\n\nAn intr...,"[0.009179973974823952, -0.05973218381404877, 0...",0.470448
80,Wild Wide AI: responsible data science,Wild Wide AI: responsible data science\n\nData...,"[0.041022028774023056, -0.00013012583076488227...",0.445177
26,Why Machine Learning Models Degrade In Production,After several failed ML projects due to unexpe...,"[0.012906364165246487, 0.030608268454670906, 0...",0.437601
29,An Introduction to Recurrent Neural Networks f...,An Introduction to Recurrent Neural Networks f...,"[-0.01791239343583584, -0.02631079964339733, 0...",0.424425
9,What if AI model understanding were easy?,Irreverent Demystifiers\n\nWhat if AI model un...,"[-0.011677316389977932, -0.0018296980997547507...",0.42285
68,Getting Started with Google BigQuery’s Machine...,"While still in Beta, BigQuery ML has been avai...","[-0.034346841275691986, 0.012737160548567772, ...",0.416386
69,Review: DeepPose — Cascade of CNN (Human Pose ...,Review: DeepPose — Cascade of CNN (Human Pose ...,"[0.01773509941995144, -0.03974172845482826, 0....",0.412199


### Clustering

In [51]:

matrix = np.vstack(trunc_data.embedding.values)
n_clusters = 4

kmeans = KMeans(n_clusters = n_clusters, init='k-means++', random_state=42)
kmeans.fit(matrix)
trunc_data['Cluster'] = kmeans.labels_

In [52]:
trunc_data

Unnamed: 0,Title,Text,embedding,relevance,Cluster
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...,"[-0.024416513741016388, 0.019436374306678772, ...",0.300155,3
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o...","[-0.011570210568606853, -0.029422052204608917,...",0.344597,3
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...,"[-0.0054690418764948845, -0.024078190326690674...",0.109014,1
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...,"[-0.003941336181014776, -0.022551342844963074,...",0.169241,2
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...,"[-0.0020211779046803713, -0.000974284252151846...",0.388754,3
...,...,...,...,...,...
95,Data Scientist’s toolkit — How to gather data ...,Data Scientist’s toolkit — How to gather data ...,"[-0.026935778558254242, -0.02546488121151924, ...",0.313273,2
96,Deep Learning on a Budget,Introduction\n\nWhy?\n\nThere are many article...,"[-0.016848241910338402, -0.045139651745557785,...",0.719206,3
97,Generating Startup names with Markov Chains,Generating Startup names with Markov Chains\n\...,"[0.013092967681586742, -0.0023525876458734274,...",0.165855,3
98,A Recipe for using Open Source Machine Learnin...,A Recipe for using Open Source Machine Learnin...,"[-0.017321426421403885, -0.021059678867459297,...",0.387291,0


In [53]:
trunc_data["Prompt"] = trunc_data["Title"] + " belongs to Cluster number " + trunc_data["Cluster"].astype(str)

In [54]:
info=''
for i in trunc_data["Prompt"]:
    info+=i+'\n'


In [55]:
print(info)

A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model belongs to Cluster number 3
Hands-on Graph Neural Networks with PyTorch & PyTorch Geometric belongs to Cluster number 3
How to Use ggplot2 in Python belongs to Cluster number 1
Databricks: How to Save Data Frames as CSV Files on Your Local Computer belongs to Cluster number 2
A Step-by-Step Implementation of Gradient Descent and Backpropagation belongs to Cluster number 3
An Easy Introduction to SQL for Data Scientists belongs to Cluster number 2
Hypothesis testing visualized belongs to Cluster number 2
Introduction to Latent Matrix Factorization Recommender Systems belongs to Cluster number 3
Which 2020 Candidate is the Best at Twitter? belongs to Cluster number 0
What if AI model understanding were easy? belongs to Cluster number 3
What I Learned from (Two-time) Kaggle Grandmaster Abhishek Thakur belongs to Cluster number 2
Making a DotA2 Bot Using ML belongs to Cluster number 3
Building A ‘Serverless’ Chrome Extension b

In [56]:
prompt="""Below is information of blog titles grouped into clusters. There are four clusters. Come up with Names for these four clusters basis the titles present in them\n\n"""+info

In [57]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": prompt}
  ]
)

In [58]:
print(response.choices[0].message.content)

Cluster Names:
1. Data Visualization and Analysis
2. Data Management and Processing
3. Machine Learning Implementations
4. AI Ethics and Understanding


## 
---

## RAG

### Introduction

<span style="font-size: 20px; color: orange">>><b>Users look at LLMs for knowledge and wisdom, yet LLMs are sophisticated predictors of what word comes next</b></span>

<span style="font-size: 16px; color: blue"><b>Hallucinations and Restricted "Parameteric" Memory are the biggest drawbacks of LLMs</b></span>

- Hallucinations - Very confidently, provide incorrect information.
- Missing Knowledge - Not having information (information available only that is available the training data)






In [59]:
prompt="What is amica developed by Portable?"

In [60]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": prompt}
  ]
)

In [61]:
pprint(response.choices[0].message.content)

('Amica is a comprehensive digital solution developed by Portable, a design '
 'and technology company based in London. It is designed to provide support '
 'and information to individuals seeking asylum in the UK, helping them '
 'navigate the complex immigration system and access the services and '
 'resources they need. Amica aims to make the asylum process more transparent '
 'and accessible for refugees, empowering them to make informed decisions '
 'about their future.')


<span style="font-size: 14px; color: blue"><b>Retrieval Augmented Generation or RAG seems to solve these problems</b></span>


<img src="../Assets/Images/RAG.png" width=800>

### Retrieval

In [62]:


def extract_page(link): #### Function to extract text from weblink ####
    address=link #### Store weblink in address variable ####
    response=requests.get(address) #### Get response from weblink using requests ####
    soup = BeautifulSoup(response.content, 'html.parser') #### Parse response using BeautifulSoup ####
    text=soup.get_text() #### Extract text from parsed response ####
    lines = filter(lambda x: x.strip(), text.splitlines()) #### Filter out empty lines ####
    website_text = "\n".join(lines) #### Join lines to form text ####
    return website_text

In [63]:
text=extract_page("https://portable.com.au/work/amica")

In [64]:
text

'Designing and developing a digital solution for separating couples | PortableMenu Close What we doPurposeProcessServicesProductsWorkTrainingContactWorkplaceTeamCareersPartnersContentEventsReportsArticlesImpact areasJusticeMental healthDeath and ageingPublic sectorEducationTransportBusiness EnquiriesEmailnewbusiness@portable.com.auMelbourne StudioWurundjeri Woi Wurrung Country5 Easey St, CollingwoodSydney StudioGadigal Country20/40 Meagher St, ChippendaleCanberra StudioNgunnawal Country1/33 Allara Street, CanberraDesigning and developing amica, a digital solution for separating couplesWe have been working with the Legal Services Commission of South Australia for a number of years in exploring user needs and iteratively designing an online dispute resolution tool, amica, to help guide former partners towards an amicable resolution of family law issues. Our team of developers then worked across multiple sprints to roll out amica for use across South Australia, and now across Australia, i

### Augmentation

In [65]:
augmented_prompt=f"You have been provided a context about below. Based only on the context answer the following question \n\n context : {text}\n\n question : {prompt}"

### Generation

In [66]:
response = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": augmented_prompt}
  ]
)

In [67]:
pprint(response.choices[0].message.content)

('Amica is a digital solution developed by Portable for separating couples to '
 'help guide former partners towards an amicable resolution of family law '
 'issues.')


---

In [68]:
encoding_name="cl100k_base"
encoding = tiktoken.get_encoding(encoding_name)
len(encoding.encode(text))

2083

In [69]:

reader = PyPDF2.PdfReader("../Assets/Data/InnovatorsDilemma.pdf")
pages = reader.pages
# get all pages data
text = "".join([page.extract_text() for page in pages])

In [70]:
len(encoding.encode(text))

117932

### RAG At Scale

In [71]:
def split_text_and_create_embeddings(text): #### Function to create embeddings from text ####
    with open('../Assets/Data/temp.txt','w') as f: #### Write text to a temporary file ####
         f.write(text) #### Write text to a temporary file ####
         f.close() #### Close temporary file ####
    loader=TextLoader('../Assets/Data/temp.txt') #### Load temporary file using TextLoader ####
    document=loader.load() #### Extract text from temporary file ####
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=2000) #### Initialize text splitter to split text into chunks of 10000 tokens ####
    docs = text_splitter.split_documents(document) #### Split document into chunks of 10000 tokens ####
    num_emb=len(docs) #### Count number of embeddings ####
    embeddings = OpenAIEmbeddings(openai_api_key=openai.api_key) #### Initialize embeddings ####
    db = FAISS.from_documents(docs, embeddings) #### Create embeddings from text ####
    return db, num_emb #### Return database with embeddings and number of embeddings ####

In [72]:
vector_db, num_emb=split_text_and_create_embeddings(text)

In [73]:
print(num_emb)

68


In [74]:
vector_db.save_local(folder_path="../Assets/Data")

In [75]:
embeddings = OpenAIEmbeddings(openai_api_key=openai.api_key)

In [76]:
local_vectors=FAISS.load_local(folder_path="../Assets/Data/",embeddings=embeddings,allow_dangerous_deserialization=True)

In [77]:
def search_context(db,query): ###### search_context function
     defin=db.similarity_search(query) ###### call the FAISS similarity_search function that searches the database for the most relevant section to the user question and orders the results in descending order of relevance
     return defin[0].page_content ###### return the most relevant section to the user question

In [78]:
prompt="Why do great firms fail?"

In [79]:
context=search_context(local_vectors,prompt)

In [80]:
context

"2  \n The   \nInnovator’s \nDilemma   \nWhen New Technologies \nCause Great Firms \nto Fail   \n  \nCLAYTON M. CHRISTENSEN  \n  \n  \nHarvard Business School Press   \nBoston, Massachusetts \n \n \n \n \n \n \n \n \n \n \n \n \n \n    3  \nCopyright © 1997 by the President and Fellows of Ha rvard College \nAll rights reserved \n \nThe Library of Congress has catalogued the hardcove r edition of this title as follows:   \n \nChristensen, Clayton M. \n   The innovator’s dilemma : when new technologies cause great firms to fail / Clayton M. Christensen.  \n    p.  cm. — (The management of innovation and cha nge series) \n   Includes index. \n   ISBN 0-87584-585-1 (alk. paper) \n   1. Creative ability in business. 2. Industrial m anagement. 3. Customer services. 4. Success in \nbusiness. I. Title. II. Series. \n   HD53.C49  1997 \n   658—DC20       96-10894  \n              CIP \n \nISBN 0-87584-585-1 (Microsoft Reader edition) \n \n \n \n \n \n \n \n \n \n \n  4  \n \nContents \n  \n    

In [81]:
augmented_prompt=f"You have been provided a context in [[[]]] and a prompt below. Respond to the prompt only from the context. If the information is not present in the context, respond with \"I don't know\"\nContext : [[[{context}]]]\nPrompt : {prompt}"

In [82]:
pprint(client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "user", "content": augmented_prompt}
  ]
).choices[0].message.content)

('Great firms fail due to their inability to stay atop their industries when '
 'they confront certain types of market and technological change. It is not '
 'about the failure of simply any company, but of well-managed companies that '
 'have their competitive antennae up, listen astutely to their customers, '
 'invest aggressively in new technologies, and yet still lose market '
 'dominance. This failure can happen in industries that move fast or slow, '
 'built on different types of technology, and in both manufacturing and '
 'service industries.')


## 
---

## Fine Tuning

### Introduction



<img src="../Assets/Images/SFT.png" width=1000>

Taking a general purpose model and train it to perform a specialized/specific task

- Hallucinations
- RAG Misses
- Learn New Information (When data size is large)
- Cost Optimization
- Privacy

**Challenges**

- Need quality data
- Upfront cost
- Expertise

There are three broad steps in LLM finetuning -

1. Data Preparation (for the specific task/use case)
2. Training (chosen training methodology)
3. Evaluation

### Data Preparation

In [None]:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

In [None]:
data_folder_path='../Assets/Data/'
training_data_file='fine_tuning_data.csv'

In [None]:
data_for_finetuning=pd.read_csv(data_folder_path+training_data_file)

In [None]:
data_for_finetuning.head()

In [None]:
data_for_finetuning.shape

In [None]:
train_data, test_data = train_test_split(data_for_finetuning, test_size=0.2, random_state=42)

In [None]:
train_data.shape

In [None]:
test_data.shape

In [None]:
prompt_end="<--"
completion_end="-->"

In [None]:
def convert_data_for_turbo(data,file):
  with open(file,'w') as f:
    for _,rows in data.iterrows():
      prompt=rows['prompt']+prompt_end
      completion=rows['completion']+completion_end
      json_line = {'messages': [{'role': 'system',
                                'content': ''},
                                {'role': 'user',
                                 'content': prompt},
                                {'role': 'assistant',
                                'content': completion}]}
      f.write(json.dumps(json_line) + '\n')

  print(f'JSONlines file "{file}" has been created.')

In [None]:
train_file_turbo_name='train_turbo.jsonl'
test_file_turbo_name='test_turbo.jsonl'

In [None]:
convert_data_for_turbo(train_data,data_folder_path+train_file_turbo_name)
convert_data_for_turbo(test_data,data_folder_path+test_file_turbo_name)

In [None]:
client.files.create(
                file=open(data_folder_path+train_file_turbo_name, "rb"),
                purpose='fine-tune'
)

In [None]:
client.files.create(
                file=open(data_folder_path+test_file_turbo_name, "rb"),
                purpose='fine-tune'
)

### Training

In [None]:
client.fine_tuning.jobs.create(
              training_file="file-6AUxqwaqG16AFy1PVLEGFKZI",
              validation_file="file-7GByyQB3nnXSKZtgLQt542UU",
              model="gpt-3.5-turbo-0125",
              suffix="Workshop",
              hyperparameters={"n_epochs":1,
                               "batch_size":1,
                               }
              )

In [None]:
print(client.fine_tuning.jobs.list(limit=2).model_dump_json(indent=5))

In [None]:
print(client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-OP2LgxuFTe4UxHoxYbYzG7Vw", limit=2).model_dump_json(indent=5))

### Evaluation

In [None]:
test_data.iloc[10]["prompt"]

In [None]:
prompt=str(test_data.iloc[10]["prompt"])+prompt_end

In [None]:
response = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0125:artificial-kimtelligence:workshop:94dD2dNr",
  messages=[
            {'role': 'system','content': ''},
            {"role": "user", "content": prompt}
  ]
)

In [None]:
print(response.choices[0].message.content[:-3])

In [None]:
response = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0125:artificial-kimtelligence:workshop:94dD2dNr",
  messages=[
    {'role': 'system','content': ''},
    {"role": "user", "content": "explain How Homomorphic Encryption Works"}
  ]
)

print(response.choices[0].message.content)

##

---

<img src="../Assets/Images/profile.png" width=50> [Hi! I'm Abhinav!](https://www.linkedin.com/in/abhinav-kimothi/)

<span style="font-size: 20px; color: orange">>><b>Hope to stay connected!</b></span>


 
[![GitHub followers](https://img.shields.io/github/followers/abhinav-kimothi?label=Follow&style=social)](https://github.com/abhinav-kimothi)
[![Me](https://img.shields.io/badge/Medium-8A2BE2)](https://medium.com/@abhinavkimothi)
[![LIn](https://img.shields.io/badge/LinkedIn-blue)](https://www.linkedin.com/in/abhinav-kimothi/)
[![Mail](https://img.shields.io/badge/eMail-green)](mailto:abhinav.kimothi.ds@gmail.com)
[![Twitter Follow](https://img.shields.io/twitter/follow/@?style=social)](https://twitter.com/abhinav_kimothi)


<span style="font-size: 20px; color: orange">>><b>Also, read these for more details on Generative AI!</b></span>



<a href="https://abhinavkimothi.gumroad.com/l/GenAILLM">
    <img src="https://public-files.gumroad.com/jsdnnne2gnhu61f6hrdprwx2255i" width=150>
</a><a href="abhinavkimothi.gumroad.com/l/RAG">
    <img src="https://public-files.gumroad.com/v17k9tp2fnbbtg8iwoxt4m3xgivq" width=150>
</a><a href="abhinavkimothi.gumroad.com/l/GenAITaxonomy">
    <img src="https://public-files.gumroad.com/a730ysxb7a928bb5xkz6fuqabaqp" width=150>
</a>



---

<img src="../Assets/Images/KCBAI.png" height =300>