# **Question 1**

**Question 1: Explain the followings:**

*    Define the context window in the context of language models.
*    Explain why the model's context window is typically kept small.
*    Identify and elaborate on which OpenAI language model boasts the largest context window
and provide a detailed explanation.

1. A context window in large language models (LLMs) refers to the amount of text (measured in tokens) that the model can process at one time when generating responses or understanding language. This window encompasses both the input text and the model's previous outputs, influencing how effectively it can generate coherent and contextually relevant responses

2. The context window size is typically kept small due to several factors:
* Computational Complexity: Increasing the context window size leads to a quadratic increase in the number of model parameters, which complicates the model's training and inference processes. This results in higher computational costs and resource demands.

* Performance Trade-offs: A smaller context window can help maintain the model's responsiveness and efficiency, as processing larger amounts of text can slow down response times and increase latency. If the window is too large, the model may also process irrelevant information, diluting its focus on the most pertinent context.

* Memory Limitations: Models have finite memory and processing capabilities. A smaller context window allows them to operate within these limits while still delivering adequate performance for many applications

3. OpenAI's latest language model, GPT-4, boasts a context window of up to 32,768 tokens, making it one of the largest among available models. This expanded context window allows the model to handle significantly longer texts, improving its ability to maintain context over extended conversations or complex documents.

# **Question 2**

**Question 2: Please discuss your understanding on the following topics:**
* Main challenges and limitations associated with training and deploying and how do
researchers and developers address these challenges to ensure model reliability and
performance?
* Explain your understanding on Embeddings and Model Fine Tuning

**1. Main challenges in training and deploying LLMs:**
* Data quality and quantity issues
* Computational resource limitations
* Fine-tuning overhead
* Inference latency
* Hallucinations
* Lack of interpretability
* Ethical considerations

**Strategies to address challenges:**
* Improving data management
* Investing in infrastructure
* Developing efficient algorithms
* Enhancing interpretability
* Implementing ethical guidelines

**2.Understanding embeddings and fine-tuning**

Embeddings: Dense vector representations of words/phrases capturing semantic meanings

Fine-tuning: Adapting a pre-trained LLM to a specific task by training on a smaller dataset to leverage general language understanding while specializing for particular applications.

# **Question 3**

**Question 3: Design a sentiment analysis application using OpenAI's models to determine the
sentiment of a given text (positive, negative, or neutral)?**

In [5]:
!pip install openai migrate
import openai



In [9]:
openai.api_key = "OPENAI_API_KEY"

In [16]:
def Senitment_analysis(text):
    messages = [
        {"role": "system", "content": """You are trained to analyze and detect the sentiment of given text.
                                        If you're unsure of an answer, you can say "not sure" and recommend users to review manually."""},
        {"role": "user", "content": f"""Analyze the following text and determine if the sentiment is: positive or negative.
                                        Return answer in single word as either positive or negative: {text}"""}
        ]



    response = openai.chat.completions.create(model="gpt-3.5-turbo",
                                              messages=messages,
                                              max_tokens=1,
                                              n=1,
                                              temperature=0)

    response_text = response.choices[0].message.content.strip().lower()


    return response_text

In [None]:
My_string = '''I like to eat chips'''
response = Senitment_analysis(My_string)
print(My_string,': The Sentiment is', response)

# **Question 4**

**Question 4: Design a Python program that compares two text inputs and determines their
similarity using the OpenAI API.**

Instructions:

* Prompt the user to input two texts for comparison.
* Utilize the OpenAI GPT-3.5-turbo language model to analyze the similarity between the
provided texts.
* Display the similarity score to the user, indicating how similar the two texts are.
* Handle errors gracefully, such as empty inputs or API request failures.
* Ensure the program is properly documented and easy to understand.
* Optionally, provide additional information to the user about the comparison process or
* the meaning of the similarity score.

In [None]:
import os
import openai
import numpy as np

# Set your OpenAI API key here
openai.api_key = os.getenv('OPENAI_API_KEY')  # Ensure your API key is set in the environment

def get_text_embedding(text):
    """Fetch the embedding for a given text using OpenAI's Embeddings API."""
    try:
        response = openai.Embedding.create(
            input=text,
            model="text-embedding-ada-002"  # Use the Ada model for embeddings
        )
        return response['data'][0]['embedding']
    except Exception as e:
        return None, f"Error fetching embedding: {e}"

def calculate_cosine_similarity(embedding_a, embedding_b):
    """Calculate the cosine similarity between two embeddings."""
    return np.dot(embedding_a, embedding_b) / (np.linalg.norm(embedding_a) * np.linalg.norm(embedding_b))

def main():
    print("Welcome to the Text Similarity Analyzer!")

    # Prompt user for input
    text1 = input("Please enter the first text: ").strip()
    text2 = input("Please enter the second text: ").strip()

    # Validate inputs
    if not text1 or not text2:
        print("Error: Both texts must be provided.")
        return

    # Get embeddings for both texts
    embedding1, error1 = get_text_embedding(text1)
    embedding2, error2 = get_text_embedding(text2)

    if error1:
        print(error1)
        return
    if error2:
        print(error2)
        return

    # Calculate similarity
    similarity_score = calculate_cosine_similarity(embedding1, embedding2)

    # Display the result
    print(f"\nSimilarity Score: {similarity_score:.4f}")
    print("Note: A score of 1 indicates the texts are identical, while a score of 0 indicates they are completely different.")

if __name__ == "__main__":
    main()

# **Question 5**

**Question 5: Develop a Python program that generates a blog post based on a user-provided topic.**

The program should use OpenAI's API to create a complete blog post including a title, a description
(approximately 300 words), related keywords, SEO meta title, SEO meta description, and a
corresponding image. The program should interact with the user through the input() function for topic
input and display the generated content within the Colab notebook.

Hints:
* Review OpenAI's API documentation, and use the appropriate models for the task.
* Practice prompt engineering to guide the AI for each specific content piece (title, description,
keywords, meta title, and description).

* For image generation, describe the kind of image that would be suitable for the blog post and
visualize it accordingly.
* Implement robust error handling for API interactions.

In [None]:
import os
import openai

# Set your OpenAI API key here
openai.api_key = os.getenv('OPENAI_API_KEY')  # Ensure your API key is set in the environment

def generate_blog_post(topic):
    """Generates a blog post based on the provided topic."""
    try:
        # Generate title
        title_prompt = f"Generate a catchy title for a blog post about '{topic}'."
        title_response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": title_prompt}],
            max_tokens=10
        )
        title = title_response['choices'][0]['message']['content'].strip()

        # Generate description
        description_prompt = f"Write a detailed blog post description about '{topic}' in approximately 300 words."
        description_response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": description_prompt}],
            max_tokens=400
        )
        description = description_response['choices'][0]['message']['content'].strip()

        # Generate related keywords
        keywords_prompt = f"Generate a list of related keywords for the blog post about '{topic}'."
        keywords_response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": keywords_prompt}],
            max_tokens=50
        )
        keywords = keywords_response['choices'][0]['message']['content'].strip()

        # Generate SEO meta title
        meta_title_prompt = f"Create an SEO-friendly meta title for a blog post about '{topic}'."
        meta_title_response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": meta_title_prompt}],
            max_tokens=10
        )
        meta_title = meta_title_response['choices'][0]['message']['content'].strip()

        # Generate SEO meta description
        meta_description_prompt = f"Write an SEO meta description for a blog post about '{topic}'."
        meta_description_response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": meta_description_prompt}],
            max_tokens=50
        )
        meta_description = meta_description_response['choices'][0]['message']['content'].strip()

        # Generate image description
        image_prompt = f"Describe a suitable image for a blog post about '{topic}'."
        image_response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": image_prompt}],
            max_tokens=50
        )
        image_description = image_response['choices'][0]['message']['content'].strip()

        return {
            "title": title,
            "description": description,
            "keywords": keywords,
            "meta_title": meta_title,
            "meta_description": meta_description,
            "image_description": image_description
        }

    except Exception as e:
        return f"An error occurred while generating the blog post: {e}"

def main():
    print("Welcome to the Blog Post Generator!")

    # Prompt user for input
    topic = input("Please enter the topic for your blog post: ").strip()

    # Validate input
    if not topic:
        print("Error: You must provide a topic.")
        return

    # Generate blog post content
    blog_post = generate_blog_post(topic)

    if isinstance(blog_post, str):  # Check if an error message was returned
        print(blog_post)
    else:
        # Display the generated content
        print("\nGenerated Blog Post Content:")
        print(f"Title: {blog_post['title']}")
        print(f"Description: {blog_post['description']}")
        print(f"Related Keywords: {blog_post['keywords']}")
        print(f"SEO Meta Title: {blog_post['meta_title']}")
        print(f"SEO Meta Description: {blog_post['meta_description']}")
        print(f"Image Description: {blog_post['image_description']}")

if __name__ == "__main__":
    main()

# **Question 6**

**Question 6: What is Hugging Face and why do we use it, briefly explore it?**
* Explore Transformers
* Explore pipeline
* Explore Hugging face API

**Hugging Face is a prominent open-source platform and community that specializes in machine learning (ML) and natural language processing (NLP). It provides tools and resources for developers and researchers to build, deploy, and share ML models efficiently.**


* What is Hugging Face?

 Hugging Face serves as a collaborative hub for AI enthusiasts, offering a wide range of pre-trained models, datasets, and tools that simplify the development of ML applications. It is often referred to as the "GitHub for AI" due to its community-driven approach and the vast repository of models available for public use. Founded in 2016, Hugging Face initially started as a chatbot application but quickly pivoted to focus on providing accessible ML tools, particularly through its Transformers library, which has become a staple in the NLP field.

* Transformers

 The Transformers library is one of Hugging Face's flagship offerings. It provides a user-friendly interface to access a wide variety of pre-trained models for tasks such as text classification, translation, summarization, and more. The library supports multiple frameworks, including PyTorch and TensorFlow, allowing developers to easily integrate these models into their projects. With over 300,000 models available, users can quickly find and implement state-of-the-art models without starting from scratch.

* Pipeline

 The pipeline feature in Hugging Face simplifies the process of using models for common tasks. By abstracting the model loading and inference steps, it allows users to perform tasks like sentiment analysis, named entity recognition, and text generation with just a few lines of code. This feature is particularly beneficial for those who may not have extensive machine learning experience, as it provides a straightforward way to leverage powerful models without deep technical knowledge.


*  Hugging Face API

 The Hugging Face API enables developers to interact with models hosted on the Hugging Face Hub programmatically. This API allows users to send requests to models and receive predictions in real-time, making it suitable for integrating ML capabilities into applications. The API supports various tasks, including text generation, image processing, and audio analysis, providing a versatile tool for developers looking to enhance their applications with AI features.


# **Question 7**

**Question 7: Build a simple text sentiment analyzer that categorizes sentences as positive,negative, or neutral.**

Hint: Pick any pre-trained HuggingFace sentiment analysis model from the Hugging Face.

In [2]:
from transformers import pipeline

def main():
    # Load the sentiment analysis pipeline
    sentiment_pipeline = pipeline("sentiment-analysis")

    print("Welcome to the Sentiment Analyzer!")

    # Prompt user for input
    sentences = []
    while True:
        sentence = input("Enter a sentence to analyze (or type 'exit' to finish): ").strip()
        if sentence.lower() == 'exit':
            break
        sentences.append(sentence)

    # Validate input
    if not sentences:
        print("Error: No sentences provided.")
        return

    # Analyze sentiment
    results = sentiment_pipeline(sentences)

    # Display results
    for i, result in enumerate(results):
        sentiment = result['label']
        score = result['score']
        print(f"Sentence: '{sentences[i]}' | Sentiment: {sentiment} | Score: {score:.4f}")

if __name__ == "__main__":
    main()

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Welcome to the Sentiment Analyzer!
Enter a sentence to analyze (or type 'exit' to finish): I like to eat chips
Enter a sentence to analyze (or type 'exit' to finish): I dont like to eat chips
Enter a sentence to analyze (or type 'exit' to finish): exit
Sentence: 'I like to eat chips' | Sentiment: POSITIVE | Score: 0.9990
Sentence: 'I dont like to eat chips' | Sentiment: NEGATIVE | Score: 0.6125


# **Question 8**

**Question 8: You have a corpus of news articles containing information about different companies
and their locations. Use HuggingFace model, develop a pipeline that extracts the names and locations
of these companies from the text.**

In [3]:
from transformers import pipeline

def extract_entities(text):
    """Extract company names and locations from the given text using NER."""
    # Load the NER pipeline
    ner_pipeline = pipeline("ner", aggregation_strategy="simple")

    # Perform NER on the input text
    entities = ner_pipeline(text)

    # Filter for company names and locations
    companies = []
    locations = []
    for entity in entities:
        if entity['entity_group'] == 'ORG':  # Organization (company names)
            companies.append(entity['word'])
        elif entity['entity_group'] == 'LOC':  # Location
            locations.append(entity['word'])

    return companies, locations

def main():
    print("Welcome to the Company and Location Extractor!")

    # Example corpus of news articles
    corpus = [
        "Apple Inc. is planning to open a new office in San Francisco.",
        "Tesla, based in Palo Alto, has announced a new electric vehicle.",
        "Amazon is expanding its operations in Seattle.",
        "Microsoft has its headquarters in Redmond, Washington."
    ]

    # Process each article in the corpus
    for article in corpus:
        print(f"\nProcessing article: '{article}'")
        companies, locations = extract_entities(article)

        # Display the results
        print(f"Extracted Companies: {companies}")
        print(f"Extracted Locations: {locations}")

if __name__ == "__main__":
    main()

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Welcome to the Company and Location Extractor!

Processing article: 'Apple Inc. is planning to open a new office in San Francisco.'


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Extracted Companies: ['Apple Inc']
Extracted Locations: ['San Francisco']

Processing article: 'Tesla, based in Palo Alto, has announced a new electric vehicle.'


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Extracted Companies: ['Tesla']
Extracted Locations: ['Palo Alto']

Processing article: 'Amazon is expanding its operations in Seattle.'


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Extracted Companies: ['Amazon']
Extracted Locations: ['Seattle']

Processing article: 'Microsoft has its headquarters in Redmond, Washington.'


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Extracted Companies: ['Microsoft']
Extracted Locations: ['Redmond', 'Washington']


# **Question 9**

**Question 9: Create a text generator that produces creative text continuations based on your input
prompts.**

Hint: Select a pre-trained model suited for text generation from the Hugging Face

In [5]:
from transformers import pipeline

def main():
    # Load the text generation pipeline with a pre-trained model
    generator = pipeline('text-generation', model='gpt2')

    print("Welcome to the Creative Text Generator!")

    # Prompt user for input
    prompt = input("Please enter a prompt for text generation: ").strip()

    # Validate input
    if not prompt:
        print("Error: You must provide a prompt.")
        return

    # Generate text continuation
    generated_text = generator(prompt, max_length=100, num_return_sequences=1)[0]['generated_text']

    # Display the generated text
    print("\nGenerated Text:")
    print(generated_text)

if __name__ == "__main__":
    main()

Welcome to the Creative Text Generator!
Please enter a prompt for text generation: In a world dominated by AI


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Text:
In a world dominated by AI, people who can't find the information, find how to find the information, and not believe in science to do it – I don't know that anybody really does them justice, because we've been taught that those people are in some sort of crisis of faith," he said. "They don't really believe it."

'A lot depends on how the facts are presented in the scientific literature. In other words, if you want to be a person, you


# **Question 10**

**Question 10: Use any one hugging face model in your project. Write a python code that uses HuggingFace tokenizers library to tokenize a given sentence using the BERT tokenizers.**

In [6]:
from transformers import BertTokenizer

def main():
    # Load the BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    # Prompt user for input
    sentence = input("Please enter a sentence to tokenize: ").strip()

    # Validate input
    if not sentence:
        print("Error: You must provide a sentence.")
        return

    # Tokenize the input sentence
    tokens = tokenizer.tokenize(sentence)

    # Convert tokens to input IDs
    input_ids = tokenizer.encode(sentence, add_special_tokens=True)

    # Display the results
    print("\nOriginal Sentence:")
    print(sentence)
    print("\nTokens:")
    print(tokens)
    print("\nInput IDs:")
    print(input_ids)

if __name__ == "__main__":
    main()

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Please enter a sentence to tokenize: I like cats

Original Sentence:
I like cats

Tokens:
['i', 'like', 'cats']

Input IDs:
[101, 1045, 2066, 8870, 102]
