In [1]:
!pip install datasets



In [2]:
from datasets import load_dataset

ds = load_dataset("toughdata/quora-question-answer-dataset")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [3]:
print(ds)
ds["train"][2]

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 56402
    })
})


{'question': 'What song has the lyrics "someone left the cake out in the rain"?',
 'answer': "MacArthur's Park\n"}

In [4]:
ds["train"].features

{'question': Value(dtype='string', id=None),
 'answer': Value(dtype='string', id=None)}

In [5]:
import pandas as pd


# Convert to Pandas DataFrame
df_train = ds['train'].to_pandas()

# Save as CSV
df_train.to_csv("quora_question_answer_train.csv", index=False)


In [6]:
df_train.head()

Unnamed: 0,question,answer
0,Why whenever I get in the shower my girlfriend...,Isn’t it awful? You would swear that there was...
1,"What is a proxy, and how can I use one?",A proxy server is a system or router that prov...
2,"What song has the lyrics ""someone left the cak...",MacArthur's Park\n
3,I am the owner of an adult website called http...,Don't let apps that are liers put adds on your...
4,Does the Bible mention anything about a place ...,St. John in the book of Revelation mentions an...


In [7]:
df_train.shape

(56402, 2)

In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56402 entries, 0 to 56401
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  56402 non-null  object
 1   answer    56402 non-null  object
dtypes: object(2)
memory usage: 881.4+ KB


In [9]:
df_train.isnull().sum()

question    0
answer      0
dtype: int64

In [10]:
df_train.duplicated().sum()

1220

In [11]:
df_train1 = df_train.drop_duplicates()

In [12]:
df_train1

Unnamed: 0,question,answer
0,Why whenever I get in the shower my girlfriend...,Isn’t it awful? You would swear that there was...
1,"What is a proxy, and how can I use one?",A proxy server is a system or router that prov...
2,"What song has the lyrics ""someone left the cak...",MacArthur's Park\n
3,I am the owner of an adult website called http...,Don't let apps that are liers put adds on your...
4,Does the Bible mention anything about a place ...,St. John in the book of Revelation mentions an...
...,...,...
56397,"Alexandria Ocasio-Cortez said ""Going by track ...","I think she’s right, one is a homosexual with ..."
56398,Is becoming a doctor financially worth it?,Yes if you want to help people and eliminate p...
56399,Where can one find the best biryani in bangalore?,Biryani crafts.These guys will give proper aut...
56400,Which smartphone is best for middle class people?,Oneplus nord\n[LINKED_TEXT: https://latesttech...


In [13]:
df_train1.shape

(55182, 2)

In [14]:
duplicates = df_train[df_train.duplicated()]
duplicates.head(10)

Unnamed: 0,question,answer
1323,Do vocal exercises really help your singing be...,Click here to know some basic singing tips\n [...
1646,x^{2}+y^{2}=2,1\n
2148,"What is a proxy, and how can I use one?",Proxy server basically acts as an intermediary...
2511,Why are Tempurpedic beds so expensive?,In terms of brand reliability - Sleepwell >= S...
2909,I forgot my Apple ID password how can I reset it?,"I forgot my Apple ID and Password, what should..."
2995,"Which Bollywood movie has a very effective, we...",For me the answer to this question is Imtiaz A...
3062,Why are western countries including the United...,"All of the above together. Yep, paradox. But n..."
3172,Whats the #1 thing you always pray for?,I can tell you about some of the most famous a...
3299,What is ताजमहल किस नदी के किनारे स्थित हैं?,Yamuna\n
3312,Why are western countries including the United...,"All of the above together. Yep, paradox. But n..."


In [15]:
df_train1.isnull().sum()

question    0
answer      0
dtype: int64

In [16]:
df_train1['answer'].duplicated().sum()

456

In [17]:
df_train1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 55182 entries, 0 to 56401
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  55182 non-null  object
 1   answer    55182 non-null  object
dtypes: object(2)
memory usage: 1.3+ MB


In [18]:
df_train1.describe()

Unnamed: 0,question,answer
count,55182,55182
unique,3234,54726
top,Would Hillary Clinton have made a better Presi...,Yes\n
freq,106,56


In [19]:
df_train1['question'].nunique()

3234

In [20]:
df_train1.nunique()

question     3234
answer      54726
dtype: int64

In [21]:
df_train1.head()

Unnamed: 0,question,answer
0,Why whenever I get in the shower my girlfriend...,Isn’t it awful? You would swear that there was...
1,"What is a proxy, and how can I use one?",A proxy server is a system or router that prov...
2,"What song has the lyrics ""someone left the cak...",MacArthur's Park\n
3,I am the owner of an adult website called http...,Don't let apps that are liers put adds on your...
4,Does the Bible mention anything about a place ...,St. John in the book of Revelation mentions an...


In [22]:
df_train1.tail()

Unnamed: 0,question,answer
56397,"Alexandria Ocasio-Cortez said ""Going by track ...","I think she’s right, one is a homosexual with ..."
56398,Is becoming a doctor financially worth it?,Yes if you want to help people and eliminate p...
56399,Where can one find the best biryani in bangalore?,Biryani crafts.These guys will give proper aut...
56400,Which smartphone is best for middle class people?,Oneplus nord\n[LINKED_TEXT: https://latesttech...
56401,Why am I always rejected by the men I am inter...,"I am a man, just not competitively and I will ..."


In [23]:
df_train1.iloc[0]['question']

'Why whenever I get in the shower my girlfriend want to join?'

In [24]:
import torch

In [25]:
from transformers import BertTokenizer, BertForQuestionAnswering


tokenizer = BertTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

question, text = "What is a proxy server?", "A proxy server is a system or router that provides a gateway between users and the internet."

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


'a system or router'

In [26]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

In [27]:
questions = df_train1['question'].tolist()
answers = df_train1['answer'].tolist()

In [28]:
from transformers import BertTokenizer, BertForQuestionAnswering


# Load pre-trained model tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [29]:
import re
def normalize_punctuation(text):
    # Remove spaces before punctuation marks
    text = re.sub(r'[^\w\s\.?!"]', '', text)
    text = re.sub(r'\s([?.!;:])', r'\1', text)

    # Normalize spaces around punctuation
    text = re.sub(r'([?.!;:])', r' \1 ', text)

    # Normalize quotes and dashes
    text = re.sub(r'“|”', '"', text)  # Replace curly double quotes with straight double quotes
    text = re.sub(r'‘|’', "'", text)  # Replace curly single quotes with straight single quotes
    text = re.sub(r'–', '-', text)    # Replace en dash with hyphen
    text = re.sub(r'—', '-', text)    # Replace em dash with hyphen

    # Remove extra spaces around punctuation
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space

    # Remove leading and trailing spaces
    text = text.strip()

    return text

In [30]:
def chunk_context(context, max_length=512, overlap=50):
    tokens = tokenizer.tokenize(context)
    chunks = []
    for i in range(0, len(tokens), max_length - overlap):
        chunk = tokens[i:i + max_length]
        chunks.append(tokenizer.convert_tokens_to_string(chunk))
    return chunks

In [31]:
def find_best_answer(question, context):
    context_chunks = chunk_context(context)
    best_answer = ""
    best_score = float('-inf')

    for chunk in context_chunks:
        inputs = tokenizer.encode_plus(question, chunk, return_tensors='pt', max_length=512, truncation=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model(**inputs)
        answer_start_scores = outputs.start_logits
        answer_end_scores = outputs.end_logits

        answer_start = torch.argmax(answer_start_scores)
        answer_end = torch.argmax(answer_end_scores) + 1

        score = answer_start_scores[0][answer_start].item() + answer_end_scores[0][answer_end - 1].item()

        if score > best_score:
            best_score = score
            best_answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))

    if best_answer.strip() == "":
        best_answer = "[CLS]"

    return best_answer

In [32]:
import logging
from transformers import logging as transformers_logging

# Set the logging level to ERROR to suppress warnings
logging.basicConfig(level=logging.ERROR)
transformers_logging.set_verbosity_error()


In [33]:
def process_question(args):
    question, context, max_length, overlap = args
    normalized_question = normalize_punctuation(question)
    context=normalize_punctuation(context)
    best_answer = find_best_answer(normalized_question, context)
    return best_answer

In [34]:
def parallel_processing(questions, contexts, max_length=512, overlap=50, num_processes=4):
    args = [(question, context, max_length, overlap) for question, context in zip(questions, contexts)]
    with multiprocessing.Pool(processes=num_processes) as pool:
        results = list(tqdm(pool.imap(process_question, args), total=len(questions)))
    return results

In [35]:
"""def preprocess_text(text):
    # Remove punctuation (except for periods and question marks)
    text = re.sub(r'[^\w\s\.?]', '', text)
    return text"""

"def preprocess_text(text):\n    # Remove punctuation (except for periods and question marks)\n    text = re.sub(r'[^\\w\\s\\.?]', '', text)\n    return text"

In [36]:
import multiprocessing
#multiprocessing.set_start_method('spawn',force=)
from tqdm import tqdm
subset_size = 10# Adjust this as needed for testing
questions_subset = questions[:subset_size]
contexts_subset = answers[:subset_size]


answers_results = parallel_processing(questions_subset, contexts_subset)

# Print results for testing
for question, answer in zip(questions_subset, answers_results):
    print("Question:", question)
    print("Best Answer:", answer)



100%|██████████| 10/10 [00:12<00:00,  1.26s/it]


Question: Why whenever I get in the shower my girlfriend want to join?
Best Answer: hot water
Question: What is a proxy, and how can I use one?
Best Answer: a system or router that provides a gateway between users and the internet . therefore it helps prevent cyber attackers from entering a private network . it is a server referred to as an intermediary because it goes between endusers and the web pages they visit online . when a computer connects to the internet it uses an ip address . this is similar to your homes street address telling incoming data where to go and marking outgoing data with a return address for other devices to authenticate . a proxy server is essentially a computer on the internet that has an ip address of its own
Question: What song has the lyrics "someone left the cake out in the rain"?
Best Answer: macarthurs park
Question: I am the owner of an adult website called https://matureanallovers.com. Can anyone offer any SEO tips to help improve my SERP ranking on Go

In [37]:
for question, context in zip(questions_subset, contexts_subset):
    print("Question:", question)
    context_chunks = chunk_context(context)
    for i, chunk in enumerate(context_chunks):
        print(f"Context Chunk {i+1}:", chunk)

Question: Why whenever I get in the shower my girlfriend want to join?
Context Chunk 1: isn ’ t it awful ? you would swear that there wasn ’ t enough hot water to go around !
Question: What is a proxy, and how can I use one?
Context Chunk 1: a proxy server is a system or router that provides a gateway between users and the internet . therefore , it helps prevent cyber attackers from entering a private network . it is a server , referred to as an “ intermediary ” because it goes between end - users and the web pages they visit online . when a computer connects to the internet , it uses an ip address . this is similar to your home ’ s street address , telling incoming data where to go and marking outgoing data with a return address for other devices to authenticate . a proxy server is essentially a computer on the internet that has an ip address of its own . how a proxy works because a proxy server has its own ip address , it acts as a go - between for a computer and the internet . your 

In [38]:
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration
import pandas

In [39]:

# Load pre-trained T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-large')
model = T5ForConditionalGeneration.from_pretrained('t5-large')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

questions = df_train1['question'].tolist()
answers = df_train1['answer'].tolist()

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [40]:
# Load T5 model and tokenizer
model_name = "t5-large"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)


In [41]:

import re
def normalize_punctuation(text):
    # Remove spaces before punctuation marks
    text = re.sub(r'[^\w\s\.?!"]', '', text)
    text = re.sub(r'\s([?.!;:])', r'\1', text)

    # Normalize spaces around punctuation
    text = re.sub(r'([?.!;:])', r' \1 ', text)

    # Normalize quotes and dashes
    text = re.sub(r'“|”', '"', text)  # Replace curly double quotes with straight double quotes
    text = re.sub(r'‘|’', "'", text)  # Replace curly single quotes with straight single quotes
    text = re.sub(r'–', '-', text)    # Replace en dash with hyphen
    text = re.sub(r'—', '-', text)    # Replace em dash with hyphen

    # Remove extra spaces around punctuation
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space

    # Remove leading and trailing spaces
    text = text.strip()

    return text

def chunk_context(context, max_length=512, overlap=50):
    tokens = tokenizer.tokenize(context)
    chunks = []
    for i in range(0, len(tokens), max_length - overlap):
        chunk = tokens[i:i + max_length]
        chunks.append(tokenizer.convert_tokens_to_string(chunk))
    return chunks

def find_best_answer(question, context):
    context_chunks = chunk_context(context)
    best_answer = ""
    best_score = float('-inf')

    for chunk in context_chunks:
        input_text = f'question: {question} context: {chunk}'
        inputs = tokenizer(input_text, return_tensors='pt', max_length=512, truncation=True).to(device)

        with torch.no_grad():
            outputs = model.generate(inputs['input_ids'], max_length=512, num_beams=5, early_stopping=True)

        answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Simple scoring mechanism
        score = len(answer.split())  # Using length as a proxy for scoring

        if score > best_score:
            best_score = score
            best_answer = answer

    if best_answer.strip() == "":
        best_answer = "[CLS]"

    return best_answer

In [42]:


# Set the logging level to ERROR to suppress warnings
logging.basicConfig(level=logging.ERROR)
transformers_logging.set_verbosity_error()

def process_question(args):
    question, context, max_length, overlap = args
    normalized_question = normalize_punctuation(question)
    context = normalize_punctuation(context)
    best_answer = find_best_answer(normalized_question, context)
    return best_answer

def parallel_processing(questions, contexts, max_length=512, overlap=50, num_processes=4):
    args = [(question, context, max_length, overlap) for question, context in zip(questions, contexts)]
    with multiprocessing.Pool(processes=num_processes) as pool:
        results = list(tqdm(pool.imap(process_question, args), total=len(questions)))
    return results

In [43]:
subset_size = 20  # Adjust this as needed for testing
questions_subset = questions[:subset_size]
contexts_subset = answers[:subset_size]

answers_results = parallel_processing(questions_subset, contexts_subset)

# Print results for testing
for question, answer in zip(questions_subset, answers_results):
    print("Question:", question)
    print("Best Answer:", answer)

100%|██████████| 20/20 [02:20<00:00,  7.02s/it]


Question: Why whenever I get in the shower my girlfriend want to join?
Best Answer: there wasnt enough hot water to go around
Question: What is a proxy, and how can I use one?
Best Answer: a computer on the internet that has an IP address of its own
Question: What song has the lyrics "someone left the cake out in the rain"?
Best Answer: MacArthurs Park
Question: I am the owner of an adult website called https://matureanallovers.com. Can anyone offer any SEO tips to help improve my SERP ranking on Google?
Best Answer: mature anal lovers
Question: Does the Bible mention anything about a place "between" heaven and hell?
Best Answer: Revelation
Question: What are useful free and open-source tools for devops and sysadmin folks?
Best Answer: vim git puppet fabric and zabbix
Question: The justice department has told the state of Missouri that they can’t void federal gun laws. What would the justice department seek to do if the state went ahead with their plans anyways? Have many states pushed

In [44]:
!pip install openai


Collecting openai
  Downloading openai-1.37.1-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.37.1-py3-none-any.whl (337 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.0/337.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading h11-0.14.0-py3-none-a

In [45]:
import openai
import os

# Set your OpenAI API key here
openai.api_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

questions = df_train1['question'].tolist()
answers = df_train1['answer'].tolist()


In [46]:
import re
def normalize_punctuation(text):
    # Remove spaces before punctuation marks
    text = re.sub(r'[^\w\s\.?!"]', '', text)
    text = re.sub(r'\s([?.!;:])', r'\1', text)

    # Normalize spaces around punctuation
    text = re.sub(r'([?.!;:])', r' \1 ', text)

    # Normalize quotes and dashes
    text = re.sub(r'“|”', '"', text)  # Replace curly double quotes with straight double quotes
    text = re.sub(r'‘|’', "'", text)  # Replace curly single quotes with straight single quotes
    text = re.sub(r'–', '-', text)    # Replace en dash with hyphen
    text = re.sub(r'—', '-', text)    # Replace em dash with hyphen

    # Remove extra spaces around punctuation
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space

    # Remove leading and trailing spaces
    text = text.strip()

    return text

In [47]:
def chunk_context(context, max_length=512, overlap=50):
    words = context.split()
    chunks = []
    for i in range(0, len(words), max_length - overlap):
        chunk = ' '.join(words[i:i + max_length])
        chunks.append(chunk)
    return chunks

def find_best_answer(question, context):
    context_chunks = chunk_context(context)
    best_answer = ""
    best_score = float('-inf')

    for chunk in context_chunks:
        prompt = f"Question: {question}\nContext: {chunk}\nAnswer:"

        response = openai.Completion.create(
            engine="text-davinci-003",
            prompt=prompt,
            max_tokens=100,
            temperature=0.5
        )

        answer = response.choices[0].text.strip()
        score = len(answer.split())  # Using length as a proxy for scoring

        if score > best_score:
            best_score = score
            best_answer = answer

    if best_answer.strip() == "":
        best_answer = "[CLS]"

    return best_answer

import logging
logging.basicConfig(level=logging.ERROR)

def process_question(args):
    question, context, max_length, overlap = args
    normalized_question = normalize_punctuation(question)
    context = normalize_punctuation(context)
    best_answer = find_best_answer(normalized_question, context)
    return best_answer

def parallel_processing(questions, contexts, max_length=512, overlap=50, num_processes=4):
    args = [(question, context, max_length, overlap) for question, context in zip(questions, contexts)]
    with multiprocessing.Pool(processes=num_processes) as pool:
        results = list(tqdm(pool.imap(process_question, args), total=len(questions)))
    return results

In [48]:
import multiprocessing
from tqdm import tqdm

subset_size = 10  # Adjust this as needed for testing
questions_subset = questions[:subset_size]
contexts_subset = answers[:subset_size]

answers_results = parallel_processing(questions_subset, contexts_subset)

# Print results for testing
for question, answer in zip(questions_subset, answers_results):
    print("Question:", question)
    print("Best Answer:", answer)

Exception in thread   0%|          | 0/10 [00:00<?, ?it/s]Thread-20 (_handle_results):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 579, in _handle_results
    task = get()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
TypeError: APIRemovedInV1.__init__() takes 1 positional argument but 2 were given
  0%|          | 0/10 [00:15<?, ?it/s]Process ForkPoolWorker-9:
Process ForkPoolWorker-12:
Process ForkPoolWorker-11:
Process ForkPoolWorker-10:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    se

KeyboardInterrupt: 

In [49]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

In [50]:
# Load pre-trained model tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [51]:
def normalize_punctuation(text):
    import re
    text = re.sub(r'[^\w\s\.?!"]', '', text)
    text = re.sub(r'\s([?.!;:])', r'\1', text)
    text = re.sub(r'([?.!;:])', r' \1 ', text)
    text = re.sub(r'“|”', '"', text)
    text = re.sub(r'‘|’', "'", text)
    text = re.sub(r'–', '-', text)
    text = re.sub(r'—', '-', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

In [52]:

def generate_answer(question, context):
    prompt = f"Based on the following context, answer the question.\n\nContext: {context}\n\nQuestion: {question}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors='pt', max_length=1024, truncation=True)
    outputs = model.generate(inputs['input_ids'], max_length=200, num_return_sequences=1, no_repeat_ngram_size=2)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

def process_question(question, context):
    normalized_question = normalize_punctuation(question)
    normalized_context = normalize_punctuation(context)
    return generate_answer(normalized_question, normalized_context)

In [53]:
questions = [
    "What is a proxy, and how can I use one?",
    "Why are the Kardashians so popular?"
]

contexts = [
    "A proxy is an intermediary server that separates end users from the websites they browse. It provides increased functionality and security.",
    "The Kardashians are popular due to their extensive media presence and reality TV show. Their lifestyle and personal drama attract significant public attention."
]

answers = [process_question(q, c) for q, c in zip(questions, contexts)]

# Print results for testing
for question, answer in zip(questions, answers):
    print("Question:", question)
    print("Generated Answer:", answer)
    print()

Question: What is a proxy, and how can I use one?
Generated Answer: Based on the following context, answer the question.

Context: A proxy is an intermediary server that separates end users from the websites they browse. It provides increased functionality and security.

Question: What is a proxy and how can I use one?
Answer:
. A Proxy is used to separate end user from websites. Aproxy is the proxy that is responsible for the communication between the end-user and the website. The proxy acts as a middleman between enduser, the browser and website. The end result is that the user is not exposed to the content of the webpage. This is why aproxy can be used as an alternative to a browser. It is also used for security purposes. For example, if a user has a malicious website, the web browser will not be able to access the malicious content. This can also be achieved by using a Proxy. Proxy can act as the intermediary between a website and end User. In this case, endUser will be exposed

Qu