**Problem Statement 1: Natural Language Processing (NLP)**

In [None]:
import nltk
import spacy
from nltk.corpus import stopwords
from string import punctuation

In [None]:
#Download resources
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#Load Spacy model
nlp = spacy.load('en_core_web_sm')

In [None]:
#Preprocesses and tokenizes input text.
def preprocess_and_tokenize(text):
  if not isinstance(text, str):# Check for non-string inputs
        raise ValueError("Input must be a string")
  text = text.lower()
  text = ''.join([char for char in text if char not in punctuation])# Remove punctuation
  doc = nlp(text)# Tokenize using spaCy
  # Remove stop words and non-alphabetic tokens
  tokens = [token.lemma_ for token in doc if token.text not in stop_words and token.is_alpha]

  return tokens

In [None]:
#Example
text = "A computer is a machine that can store and process information. Most computers rely on a binary system, which uses two variables, 0 and 1, to complete tasks such as storing data, calculating algorithms, and displaying information."
tokens = preprocess_and_tokenize(text)
print(tokens)

['computer', 'machine', 'store', 'process', 'information', 'computer', 'rely', 'binary', 'system', 'use', 'two', 'variable', 'complete', 'task', 'store', 'datum', 'calculate', 'algorithm', 'display', 'information']


**Problem Statement 2: Text Generation**

In [None]:
# Import the pipeline function from the Hugging Face transformers library
from transformers import pipeline
# Create a text generation pipeline using a pre-trained model 'distilgpt2'
# 'distilgpt2' is a smaller and faster variant of GPT-2
generator = pipeline("text-generation", model="distilgpt2")
# Use the text generator to produce two sequences of text based on the provided prompt
generator("I will like to", max_length=50, num_return_sequences=2)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I will like to thank the guys we have come through and we are always impressed with them to a large extent.'},
 {'generated_text': 'I will like to express my opinion that when the Supreme Court is set through the courts we are able to get the justice who will lead the whole nation in this Court.'}]

**Problem Statement 3: Prompt Engineering**

In [None]:
from transformers import pipeline
from datasets import load_metric

# Load the QA model and ROUGE metric
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
rouge = load_metric("rouge")

# Updated context about machine learning
context = """
Machine learning is a field of artificial intelligence that focuses on the development of algorithms that enable computers to learn from and make predictions based on data. It encompasses a range of techniques, including supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model on labeled data, while unsupervised learning involves finding hidden patterns in unlabeled data. Reinforcement learning is based on the idea of learning through trial and error, where an agent learns to make decisions by receiving rewards or penalties. Machine learning has applications in various domains such as healthcare, finance, and autonomous vehicles.
"""

questions = [
    "What are the main techniques used in machine learning?",
    "How does supervised learning differ from unsupervised learning?",
    "What is the purpose of reinforcement learning?"
]

# Different prompt designs
prompts = [
    "Based on the context, answer the following question:",
    "Using the provided context, respond to this question:",
    "Given the context, what is the answer to the question below?"
]

# Generate answers for each prompt
answers = []
for question in questions:
    for prompt in prompts:
        full_prompt = prompt + "\n" + question
        result = qa_pipeline(question=full_prompt, context=context)
        answer = result['answer']
        answers.append((question, prompt, answer))
        print(f"Question: {question}\nPrompt: {prompt}\nAnswer: {answer}\n")

# Reference answers for evaluation (using known correct answers)
reference_answers = [
    "Machine learning techniques include supervised learning, unsupervised learning, and reinforcement learning.",
    "Supervised learning involves training on labeled data, while unsupervised learning involves finding patterns in unlabeled data.",
    "Reinforcement learning involves learning through trial and error by receiving rewards or penalties."
]

# Prepare data for evaluation
predictions = [a[2] for a in answers]
references = [[reference] for reference in reference_answers]

# Evaluate using ROUGE
rouge_scores = [rouge.compute(predictions=[pred], references=[ref]) for pred, ref in zip(predictions, references)]

for i, score in enumerate(rouge_scores):
    print(f"ROUGE scores for Question {i + 1}:")
    print(f"ROUGE-1: {score['rouge1'].mid.fmeasure:.4f}")
    print(f"ROUGE-2: {score['rouge2'].mid.fmeasure:.4f}")
    print(f"ROUGE-L: {score['rougeL'].mid.fmeasure:.4f}")


Question: What are the main techniques used in machine learning?
Prompt: Based on the context, answer the following question:
Answer: supervised learning, unsupervised learning, and reinforcement learning

Question: What are the main techniques used in machine learning?
Prompt: Using the provided context, respond to this question:
Answer: supervised learning, unsupervised learning, and reinforcement learning

Question: What are the main techniques used in machine learning?
Prompt: Given the context, what is the answer to the question below?
Answer: supervised learning, unsupervised learning, and reinforcement learning

Question: How does supervised learning differ from unsupervised learning?
Prompt: Based on the context, answer the following question:
Answer: finding hidden patterns in unlabeled data

Question: How does supervised learning differ from unsupervised learning?
Prompt: Using the provided context, respond to this question:
Answer: finding hidden patterns in unlabeled data

