# Data

# Introduction

There is an initial draft version in the end as well. 
Although the result was not so great, it served as a foundation for my final version. 
There were a lot of experimenting which I decided not to include here.
Hope my work is clear and readable. 

# Final version

## Code

In [1]:
import torch
from datasets import load_dataset
from transformers.models.bert import BertTokenizer, BertForQuestionAnswering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

datafiles = {"train": "datatrain.csv", "test": "datatest.csv"}
dataset = load_dataset("data", data_files=datafiles, delimiter=";", encoding='cp1252')

def segment_documents(dataset, max_doc_length=450):
    segmented_docs = []
    for i in range(len(dataset)):
        doc_values = list(dataset[i].values())
        segmented_docs.extend([' '.join(doc_values[j:j+max_doc_length]) for j in range(0, len(doc_values), max_doc_length)])
    return segmented_docs

def get_top_k_articles(query, segmented_docs, k=2):
    vectorizer = TfidfVectorizer(analyzer="word", stop_words='english')
    query_and_docs = [query] + segmented_docs
    matrix = vectorizer.fit_transform(query_and_docs)
    scores = [cosine_similarity(matrix[0], matrix[i])[0][0] for i in range(1, len(query_and_docs))]
    sorted_list = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    top_doc_indices = [x[0] for x in sorted_list[:k]]
    top_docs = [segmented_docs[x] for x in top_doc_indices]
    return top_docs

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

def answer_question(question, answer_text):
    input_ids = tokenizer.encode(question, answer_text, max_length=512, return_tensors='pt')
    token_type_ids = torch.zeros_like(input_ids)
    attention_mask = torch.ones_like(input_ids)

    outputs = model(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, return_dict=True)

    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores) + 1

    # Convert the tokens back to text
    answer_tokens = tokenizer.convert_ids_to_tokens(input_ids[0][answer_start:answer_end])
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    print('Answer: "' + answer + '" with ' + f'{torch.max(torch.softmax(start_scores, dim=1)).item()*100:.2f}' + '% confidence' )
    return answer


Found cached dataset csv (C:/Users/Aigerim/.cache/huggingface/datasets/csv/data-726cb969e4b012c5/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/2 [00:00<?, ?it/s]

## Tests

In [2]:
# Enter our query here
import torch 
query = "How many employees in Hyperconnect company?"
#query = "What else does the bassist for Death From Above play?"
#query = "What projects is Jesse Keeler involved in?"

# Segment our documents
segmented_docs = segment_documents(dataset['train'], 450)

# Retrieve the top k most relevant documents to the query
candidate_docs = get_top_k_articles(query, segmented_docs, 3)

# Return the likeliest answers from each of our top k most relevant documents in descending order
for i in candidate_docs:
    answer_question(query, i)
    print ("Reference Document: ", i)
    
print("------------------------------------")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Answer: "employees across the company" with 9.03% confidence
Reference Document:  Our solution helped Hyperconnect improve their marketing performance indicators and reduce employees' time on routine data operations. Now, our client has access to more granular data and makes informed decisions faster. According to Hyperconnect, their marketing team now concentrates on marketing campaign optimization rather than data integration tasks. Furthermore, our ETL system provided employees across the company with easy access to all marketing reports. Sales managers, stakeholders, business analysts, and other competent employees can now analyze the company’s marketing efforts whenever they need to and instantly draw up marketing reports. Our advanced data extraction system has helped the company’s analysts find previously overlooked data and more precisely predict the outcomes of marketing strategies.  We’ve also provided Hyperconnect with more flexibility. Improvado is a scalable system, so whe

In [3]:
# Enter our query here
import torch 
# query = "How many employees in Hyperconnect company?"
query = "Where is located headquater of Hyperconnect company?"
#query = "What projects is Jesse Keeler involved in?"

# Segment our documents
segmented_docs = segment_documents(dataset['train'], 450)

# Retrieve the top k most relevant documents to the query
candidate_docs = get_top_k_articles(query, segmented_docs, 3)

# Return the likeliest answers from each of our top k most relevant documents in descending order
for i in candidate_docs:
    answer_question(query, i)
    print ("Reference Document: ", i)

Answer: "seoul , south korea" with 98.32% confidence
Reference Document:  Hyperconnect aims to empower each person with an ability to connect and keep in touch with others. It’s a mid-sized company with around 400 employees, headquartered in Seoul, South Korea. As a global social platform, the company provides video and AI-powered software that helps users communicate in real time.
Hyperconnect was dealing with fragmented data across different marketing channels. To extract this data, the company allocated developers to write lines of code that acted as marketing connectors. This process took too much time and led to  excessive use of the company’s resources. Furthermore, the gathered data was not normalized, which prevented marketing analysts from making informed decisions. Eventually, Hyperconnect contacted Improvado to optimize the data extraction process.
Answer: "" with 14.76% confidence
Reference Document:  Our solution helped Hyperconnect improve their marketing performance indi

## Evaluation

A good amount of improvements needs to be done.
1) make a dataset with questions and answers to each questions
2) add metrics (F1, Recall, prediction) so it would be easier to visualize
3) Add tuning or maybe use the trained model to apply the model to test dataset
4) Use other models such as GPT2, GPT3 and so on.
5) Build interface around this system. With website link and question as an input and Answer and paragraph as an output.


P.S. I really enjoyed this, thank you! 

# Drafts

## Draft 1

In [4]:
import torch
from datasets import load_dataset
from ast import literal_eval
datafiles = {"train": "datatrain.csv", "test": "datatest.csv"}
dataset = load_dataset("data", data_files = datafiles, delimiter=";", encoding='cp1252')

Found cached dataset csv (C:/Users/Aigerim/.cache/huggingface/datasets/csv/data-726cb969e4b012c5/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
docs = dataset['train'][0]
for i in range(len(dataset['train'])):
    for doc in dataset['train'][i].values():
        print (doc)


Marketing data organization is a huge problem when promoting your product across different channels and countries and managing multiple campaigns. Our client, Hyperconnect, faced this problem and asked us to find a solution.
Hyperconnect aims to empower each person with an ability to connect and keep in touch with others. It’s a mid-sized company with around 400 employees, headquartered in Seoul, South Korea. As a global social platform, the company provides video and AI-powered software that helps users communicate in real time.
Hyperconnect was dealing with fragmented data across different marketing channels. To extract this data, the company allocated developers to write lines of code that acted as marketing connectors. This process took too much time and led to  excessive use of the company’s resources. Furthermore, the gathered data was not normalized, which prevented marketing analysts from making informed decisions. Eventually, Hyperconnect contacted Improvado to optimize the d

In [6]:
def segment_documents(docs, max_doc_length=450):
    # List containing full and segmented docs
    segmented_docs = []
    for i in range(len(dataset['train'])):
        for doc in dataset['train'][i].values():
#         for doc in docs:
        # Split document by spaces to obtain a word count that roughly approximates the token count
            split_to_words = doc.split(" ")

            # If the document is longer than our maximum length, split it up into smaller segments and add them to the list 
            if len(split_to_words) > max_doc_length:
                for doc_segment in range(0, len(split_to_words), max_doc_length):
                    segmented_docs.append( " ".join(split_to_words[doc_segment:doc_segment + max_doc_length]))

            # If the document is shorter than our maximum length, add it to the list
            else:
                segmented_docs.append(doc)

    return segmented_docs

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def get_top_k_articles(query, docs, k=2):

    # Initialize a vectorizer that removes English stop words
    vectorizer = TfidfVectorizer(analyzer="word", stop_words='english')

    # Create a corpus of query and documents and convert to TFIDF vectors
    query_and_docs = [query] + docs
    matrix = vectorizer.fit_transform(query_and_docs)

    # Holds our cosine similarity scores
    scores = []

    # The first vector is our query text, so compute the similarity of our query against all document vectors
    for i in range(1, len(query_and_docs)):
        scores.append(cosine_similarity(matrix[0], matrix[i])[0][0])

    # Sort list of scores and return the top k highest scoring documents
    sorted_list = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    top_doc_indices = [x[0] for x in sorted_list[:k]]
    top_docs = [docs[x] for x in top_doc_indices]
  
    return top_docs

In [8]:
from transformers.models.bert import BertTokenizer, BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [11]:

def answer_question(question, answer_text):

	input_ids = tokenizer.encode(question, answer_text, max_length=512)
	
	# ======== Set Segment IDs ========
	# Search the input_ids for the first instance of the `[SEP]` token.
	sep_index = input_ids.index(tokenizer.sep_token_id)

	# The number of segment A tokens includes the [SEP] token istelf.
	num_seg_a = sep_index + 1

	# The remainder are segment B.
	num_seg_b = len(input_ids) - num_seg_a

	# Construct the list of 0s and 1s.
	segment_ids = [0]*num_seg_a + [1]*num_seg_b

	# There should be a segment_id for every input token.
	assert len(segment_ids) == len(input_ids)

	outputs = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]), return_dict=True) 

	start_scores = outputs.start_logits
	end_scores = outputs.end_logits

    # ======== Reconstruct Answer ========
	# Find the tokens with the highest `start` and `end` scores.
	answer_start = torch.argmax(start_scores)
	answer_end = torch.argmax(end_scores)

        
	# Get the string versions of the input tokens.
	tokens = tokenizer.convert_ids_to_tokens(input_ids)

	# Start with the first token.
	answer = tokens[answer_start]

	# Select the remaining answer tokens and join them with whitespace.
	for i in range(answer_start + 1, answer_end + 1):
		
		# If it's a subword token, then recombine it with the previous token.
		if tokens[i][0:2] == '##':
			answer += tokens[i][2:]
		
		# Otherwise, add a space then the token.
		else:
			answer += ' ' + tokens[i]

	print('Answer: "' + answer + '" with ' + f'{answer_end}' + '% confidence' )

In [12]:
# Enter our query here
import torch 
query = "How many employees in Hyperconnect company?"
#query = "What else does the bassist for Death From Above play?"
#query = "What projects is Jesse Keeler involved in?"

# Segment our documents
segmented_docs = segment_documents(dataset['train'], 450)

# Retrieve the top k most relevant documents to the query
candidate_docs = get_top_k_articles(query, segmented_docs, 3)

# Return the likeliest answers from each of our top k most relevant documents in descending order
for i in candidate_docs:
    answer_question(query, i)
    print ("Reference Document: ", i)
    
print("------------------------------------")

Answer: "employees across the company" with 84% confidence
Reference Document:  Our solution helped Hyperconnect improve their marketing performance indicators and reduce employees' time on routine data operations. Now, our client has access to more granular data and makes informed decisions faster. According to Hyperconnect, their marketing team now concentrates on marketing campaign optimization rather than data integration tasks. Furthermore, our ETL system provided employees across the company with easy access to all marketing reports. Sales managers, stakeholders, business analysts, and other competent employees can now analyze the company’s marketing efforts whenever they need to and instantly draw up marketing reports. Our advanced data extraction system has helped the company’s analysts find previously overlooked data and more precisely predict the outcomes of marketing strategies.  We’ve also provided Hyperconnect with more flexibility. Improvado is a scalable system, so when 

In [13]:
# Enter our query here
import torch 
# query = "How many employees in Hyperconnect company?"
query = "Where is located headquater of Hyperconnect company?"
#query = "What projects is Jesse Keeler involved in?"

# Segment our documents
segmented_docs = segment_documents(dataset['train'], 450)

# Retrieve the top k most relevant documents to the query
candidate_docs = get_top_k_articles(query, segmented_docs, 3)

# Return the likeliest answers from each of our top k most relevant documents in descending order
for i in candidate_docs:
    answer_question(query, i)
    print ("Reference Document: ", i)

Answer: "seoul , south korea" with 55% confidence
Reference Document:  Hyperconnect aims to empower each person with an ability to connect and keep in touch with others. It’s a mid-sized company with around 400 employees, headquartered in Seoul, South Korea. As a global social platform, the company provides video and AI-powered software that helps users communicate in real time.
Hyperconnect was dealing with fragmented data across different marketing channels. To extract this data, the company allocated developers to write lines of code that acted as marketing connectors. This process took too much time and led to  excessive use of the company’s resources. Furthermore, the gathered data was not normalized, which prevented marketing analysts from making informed decisions. Eventually, Hyperconnect contacted Improvado to optimize the data extraction process.
Answer: "[SEP]" with 14% confidence
Reference Document:  Our solution helped Hyperconnect improve their marketing performance indic