#### Difference between Prompt Engineering, Model Fine Tuning, and Embeddings based search. 

- Prompt Engineering: Crafting effective prompts to get the desired output from a language model. The methodology of communication with an AI chatbot to make the best use of its capabilities.


- Model Fine Tuning: Fine-tuning in NLP refers to the process of re-training a pre-trained model using custom data for domain adaptation. You are adding or augmenting to the model's existing knowledge base.


- Embeddings based search: Providing the model with new data while erasing the model's previous learnings. Drawing boundary conditions such that the model is only able to answer the questions based on the data provided, and saying "Sorry, I cannot answer that." to questions that don't concern the data provided. 

# Embeddings based Question-Answering

In [2]:
import openai
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:········


In [4]:
# Importing the dataset

import pandas as pd

df = pd.read_excel('BTS_chapter2.xlsx')

In [5]:
df.head()

Unnamed: 0,text
0,Subject: Proof Album by BTS. Proof is the firs...
1,"Subject: ""Jack in the Box"" album by J-hope of ..."
2,Subject: Indigo Album by RM of BTS. Indigo is ...
3,Subject: Face Album by Jimin of BTS. Face is t...
4,D-Day is the debut studio album by South Korea...


In [26]:
#df['Number of tracks'] = df['Number of tracks'].astype(str)

In [27]:
#df['text'] = "Album Name: " + df['Album Name'] + "; Artist: " + df['Artist'] + "; Release Date: " + df["Release Date"] + "; Language: " + df['Language'] + "; Number of tracks: " + df['Number of tracks'] + "; Length: " + df["Length"] + "; Overview: " + df['Overview'] + "; List of song names: " + df['List of song names']

In [8]:
from openai.embeddings_utils import get_embedding
df['embedding'] = df.text.apply(lambda x: get_embedding(x, engine=f'text-embedding-ada-002'))
df.head()

Unnamed: 0,text,embedding
0,Subject: Proof Album by BTS. Proof is the firs...,"[-0.028880411759018898, -0.03018958680331707, ..."
1,"Subject: ""Jack in the Box"" album by J-hope of ...","[-0.024111641570925713, -0.04682282730937004, ..."
2,Subject: Indigo Album by RM of BTS. Indigo is ...,"[-0.008107353933155537, -0.024580668658018112,..."
3,Subject: Face Album by Jimin of BTS. Face is t...,"[-0.04436878114938736, -0.012539854273200035, ..."
4,D-Day is the debut studio album by South Korea...,"[-0.029493065550923347, -0.013064444065093994,..."


In [9]:
#df.drop(['Album Name', 'Artist', 'Language', 'Release Date', 'Number of tracks', 'Length', 'Overview', 'List of song names'], axis=1, inplace=True)

#df.head()

In [10]:
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

In [11]:
import ast  # for converting embeddings saved as strings back to arrays
import openai  # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
from scipy import spatial  # for calculating vector similarities for search


In [12]:
# search function


def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]
    

In [13]:
# examples
strings, relatednesses = strings_ranked_by_relatedness("Jack in the Box", df, top_n=1)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.845


'Subject: "Jack in the Box" album by J-hope of BTS. Jack in the Box\xa0is the debut\xa0studio album\xa0of South Korean rapper\xa0J-Hope, released on July 15, 2022, through\xa0Big Hit Music. It contains 10 tracks, including the lead single, "More", which preceded it on July 1, and the follow-up single, "Arson", which was released the same day as the album, together with an accompanying music video. A concept album revolving around the story of\xa0Pandora\'s box,\xa0Jack in the Box\xa0discusses themes of passion, ambition, humanity, insecurity, success, and anxiety about the future. Primarily an\xa0old-school hip hop\xa0record, the album features a blend of genres, including\xa0pop,\xa0grunge, and\xa0R&B. Jack in the Box represents J-Hope\'s "own musical personality and vision as an artist" and his "aspirations to break the mold and grow further".In an interview for Weverse Magazine published in June 2022, the rapper divulged his desire to showcase a "different side of me...an extremely 

In [14]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below descriptions of BTS chapter 2 projects to answer the subsequent question. If the answer cannot be found in the description, write "I could not find an answer. I specialise in Chapter 2 of BTS and not on the topic you asked about."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_album = f'\n\nBTS Chapter 2 project description:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_album + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_album
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about BTS Chapter 2 projects."},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message

#### Now that the gpt model has been trained on custom data, let us test it by asking it questions based on the data.

In [15]:
ask("When did J-hope release his latest album?")

'J-hope released his latest album, "Jack in the Box", on July 15, 2022.'

In [16]:
ask("What is the theme of the album D-Day by Agust D?")

'The theme of the album D-Day by Agust D is liberation and freedom, as well as reflecting on the meaning of both and encouraging listeners to focus on themselves instead of dwelling on the past or fearing the future. The album also features thought-provoking social commentary and personal reflections.'

In [17]:
ask("Tell me about the song Hageum from D-Day album.")

'"Haegeum" is the second single from Agust D\'s D-Day album. It is a heavy hip hop song that features a South Korean traditional two-stringed fiddle called haegeum in its instrumentation. The track is about advocating for freedom in a reality built on unspoken societal expectations and restrictions echoed in today\'s online culture. Agust D asks listeners to question their own liberation and the role they play in that of others, making a case for "doing away \'with the nonsense\' that clutters online" and reality.'

In [59]:
ask("What inspired the lyrics of the Face album?")

'The Face album by Jimin of BTS was heavily inspired by the emotional impact of the COVID-19 pandemic on Jimin as a person and performer and addresses "themes of loneliness, wrestling with oneself, and finding freedom".'

In [17]:
ask("Tell me about the theme of song The Astronaut. Why is it named that way? ")

"The theme of the song The Astronaut by Jin of BTS is about his affection for and relationship with his fans. It explores themes of connection and love through the use of a recurring cosmic motif favored by both the band and singer, as evidenced in other songs they have written. The song's title, The Astronaut, refers to the accompanying music video that portrays the singer as an alien astronaut who accidentally crash-lands on Earth and must eventually choose between staying with his found family or returning to his home planet."

In [20]:
ask("Summarize everything about the album Indigo.")

'Indigo is the debut studio album by RM of BTS, released on December 2, 2022, through Big Hit Music. The album comprises 10 tracks, including a collaboration with Youjeen of Cherry Filter, and features appearances by Erykah Badu, Anderson .Paak, Tablo of Epik High, Kim Sa-wol, Paul Blanco, Mahalia, Colde, and Park Ji-yoon. The album peaked at number two in South Korea, number three in Lithuania, Portugal, and the United States, and number four in Japan. It was certified double platinum by the Korea Music Content Association and has sold over 700,000 copies domestically. The album recounts "stories and experiences [RM] has gone through, like a diary" and serves as a documentation or archive of his late twenties. The album\'s lead single, "Wild Flower", was released alongside the album on December 2, 2022, and its music video premiered on YouTube. The album\'s theme is the colors of nature, human, etc., and it is a documentation of RM\'s youth in the moment of independent phase. The pain

In [23]:
ask("Why did J-hope release the song On the Street?")

'J-hope released the song "On the Street" as a "meaningful gift" to his fans before he begins his mandatory military service. The song is an ode to J-Hope\'s artistic roots and the intersection of his love for street dance and hip hop.'

In [24]:
ask("Who is J Cole?")

'J Cole is an American rapper who collaborated with J-Hope of BTS on the song "On the Street".'

In [26]:
ask("Is Run BTS a song?")

'Yes, Run BTS is a song included in the BTS Chapter 2 project Proof Album by BTS. It is featured in disc two of the 3-disc project and is described as a hip-hop/rock track.'

In [29]:
ask("what inspired the Indigo album?")

'The Indigo album by RM of BTS was inspired by "stories and experiences [he] has gone through, like a diary." and serves as a documentation or archive of his late twenties. It also features the painting Blue, by the late Korean artist Yun Hyong-keun, whom RM is known to be an admirer of. RM has said that with Indigo he created a collaboration that "transcends boundaries" between music and art.'

In [31]:
ask("What is the content of the music video of On the Street?")

'The music video of On the Street shows J-Hope walking and dancing through the streets of New York City then in the Bowery subway station while Cole is seen standing and dancing on the rooftop of a city building. J-Hope then enters the Bowery subway station as Cole begins rapping his verses, and dances freestyle on an empty platform. He exits the subway as the song winds down and the visual fades to black. A bonus end-scene shows J-Hope meeting Cole atop the rooftop and the two "exchanging pleasantries", reminiscent of the scene from the teaser, before again fading to black.'

In [35]:
ask("Who are the people in the music video of The Astronaut?")

'The music video of The Astronaut portrays Jin as an alien astronaut who accidentally crash-lands on Earth and must eventually choose between staying with his found family or returning to his home planet. Chris Martin makes a brief cameo in one scene, as a television newscaster who announces the sighting of the light beam. Therefore, the people in the music video are Jin and Chris Martin.'

In [36]:
ask("What about the child in the music video of the Astronaut?")

'In the music video of the Astronaut, Jin befriends a young girl while on Earth and places his helmet on her head as a farewell before leaving to return to his home planet.'

In [40]:
ask("Has IU ever collaborated  with Suga?")

'Yes, IU has collaborated with Suga on the track "People Pt. 2" from Suga\'s album D-Day.'

In [43]:
ask("Is there a music video for Set Me Free Pt.2 of the Face Album?")

'Yes, there is a music video for Set Me Free Pt.2 of the Face Album.'

In [20]:
ask("How to book flight tickets to South Korea?")

'I could not find an answer. I specialize in Chapter 2 of BTS and not on the topic you asked about.'

Now the language model can only answer questions related to the data it was trained on. It cannot answer questions based on topics outside of the training data.