- Author: Elian Freyermuth
- Date: 10/12/2023

Summary:  
First, thank you for this technical test. I had the occasion to try different methods as to have good results. I first tried using the Jaccard token based distance to find the similarity between the questions and the article but it didn't give good results. Thus, I have decided to use a custom tokenizer and Embedding layer to process the data, the goal was to take the embeddings of the articles and for each question compare the embedding distance (Minkowski or Cosine similarity). Previously used in two previous projets it worked like a charm, but here it struggled a lot and the questions were not well bounded to the articles.
<br/><br/>
As a consequence, I have decided to try pre-trained bigger model such as bert-base-uncased, xlm-roberta-base and xlm-v to get the embeddings, however it doesn't seem to give proper results either.  
I have decided to stop here after long research of understanding what could be the problem. I would be glad to have any advice about this test and my methods.
<br/><br/>

Thank you again,  
Best regards,  
Elian Freyermuth

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
!pip install textdistance
import textdistance as td
import torch

from tqdm.auto import tqdm
tqdm.pandas()



## Dataset reading and cleaning

In [None]:
questions_df = pd.read_csv("./simulated_questions.csv")
questions_df

Unnamed: 0,question
0,How to make an explosive in FC5?
1,Are there any news on when AC Jade will be ava...
2,Is it possible to continue my progression if I...
3,Will Starlink run on a GPU with 2GB RAM? Is th...
4,How to report a toxic playing behavior in Mari...
5,I have an error BE8A522E. What does it mean?
6,"I wanted to play HoMM IV after a long break, b..."
7,How to sign up for Just Dance+?
8,I can't find the Discovery Tour in ACV despite...
9,Comment puis-je informer les développeurs d'Ub...


In [None]:
faq_articles_df = pd.read_csv("./FAQ_articles.csv")
faq_articles_df

Unnamed: 0,articleID,title,html_content,URL
0,74416,Contents of the Assassin's Creed: Syndicate Se...,<div>The Season Pass offers access to addition...,https://www.ubisoft.com/help/game/article/titl...
1,97885,Linking your YouTube and Ubisoft accounts,"<div>To <strong>link</strong> your <a href=""ht...",https://www.ubisoft.com/help/game/article/titl...
2,80266,Unlocking Ubisoft Connect Rewards for legacy g...,<div>Some older Ubisoft titles no longer recei...,https://www.ubisoft.com/help/game/article/titl...
3,80333,Error code 17008 in Ubisoft Connect,"If you receive this error, there are a few wor...",https://www.ubisoft.com/help/game/article/titl...
4,105788,Decommissioning of online services for older U...,<div>Decommissioning the online services for <...,https://www.ubisoft.com/help/game/article/titl...
...,...,...,...,...
95,94241,Upgrading Rainbow Six: Siege from PlayStation ...,<div>\n<div>\n<div>\n<div>In order to <u><a hr...,https://www.ubisoft.com/help/game/article/titl...
96,96236,Checking your NAT type on PC,"<div>The <a href=""https://ubisoft.com/help/art...",https://www.ubisoft.com/help/game/article/titl...
97,97861,Carrying over your Rocksmith+ Closed Beta prog...,The <em>Rocksmith+</em> Closed Beta is a work ...,https://www.ubisoft.com/help/game/article/titl...
98,65237,Language options in Assassin's Creed: The Rebe...,<div>Below is a list of supported languages fo...,https://www.ubisoft.com/help/game/article/titl...


In [None]:
def extract_text_from_html(html:str)->str:
    html_example = faq_articles_df["html_content"][0]
    soup = BeautifulSoup(html, "html.parser")
    return soup.get_text().strip()

faq_articles_df["html_content"] = faq_articles_df["html_content"].map(extract_text_from_html) #no need to keep original html

# Jaccard distance method

In [None]:
question_test = questions_df.iloc[0]["question"]
print(f"Question: \"{question_test}\"")
distance = lambda html_text_content: td.jaccard(question_test.split(), html_text_content.split())

tmp_df = pd.DataFrame(faq_articles_df["html_content"].copy(deep=True))
tmp_df["score"] = tmp_df["html_content"].map(distance)
tmp_df.sort_values(by="score", ascending=False, inplace=True)
tmp_df.to_csv("./tmp.csv")
print(tmp_df)

Question: "How to make an explosive in FC5?"
                                         html_content     score
60  To delete an Autodance:\n• Highlight the Just ...  0.065217
35  At Ubisoft we are committed to providing an in...  0.065217
13  At Ubisoft we are committed to providing an in...  0.060000
97  The Rocksmith+ Closed Beta is a work in progre...  0.050000
84  This error indicates that you have been unable...  0.050000
..                                                ...       ...
48  セーブデータはクラウドサーバーに保存されます。Nintendo Switch本体／SDカード...  0.000000
91  To use photo mode:\n•  Open the in-game menu.\...  0.000000
29  Below is a list of the supported languages for...  0.000000
2   Some older Ubisoft titles no longer receive pa...  0.000000
98  Below is a list of supported languages for Ass...  0.000000

[100 rows x 2 columns]


To explain my path of research. I started by trying unsupervized learning method by using string similarity distance as it was the easiest.
Unfortunately as seen previously, the results aren't that great and doesn't lead to the correct answers.
Therefore it lead me to try using embedding and nearest neighbours.

# Custom Tokenizer and Embedding Layer

In [None]:
class CustomTokenizer:
    def __init__(self):
        self.vocab = {}

    def text_cleaner(text:str)->str:
        is_alpha_space_filter = lambda c: str.isalnum(c) or c==" "
        return ''.join(filter(is_alpha_space_filter, text))

    @property
    def vocab_size(self):
        return len(self.vocab)

    def tokenize(self, text):
        tokens = CustomTokenizer.text_cleaner(text).lower().split()  # Replace this with your custom tokenization logic
        return tokens

    def __call__(self, text:str): return self.sentence_to_ids(text)

    def add_vocab_from_sentence(self, sentence):
        tokens = self.tokenize(sentence)
        for token in tokens:
            if token not in self.vocab:
                self.vocab[token] = len(self.vocab)

    def sentence_to_ids(self, sentence):
        return self.__tokens_to_ids(self.tokenize(sentence))

    def __tokens_to_ids(self, tokens):
        return [self.vocab[token] for token in tokens if token in self.vocab]


In [None]:
custom_tokenizer = CustomTokenizer()
questions_df["question"].map(custom_tokenizer.add_vocab_from_sentence)
faq_articles_df["html_content"].map(custom_tokenizer.add_vocab_from_sentence)
print("Vocabular size:", custom_tokenizer.vocab_size)

Vocabular size: 2166


In [None]:
embedding = torch.nn.Embedding(tokenizer.vocab_size, 1024)
def embed(text:str):
    tokenized = custom_tokenizer(text)
    if len(tokenized)==0:
        return None
    text_embedded = embedding(torch.tensor(tokenized)).mean(dim=0).cpu().detach().numpy()
    return text_embedded

In [None]:
faq_articles_df["content_embedding"] = faq_articles_df["html_content"].progress_apply(embed)
faq_articles_df.dropna(subset=["content_embedding"], inplace=True)
faq_articles_df["title_embedding"] = faq_articles_df["title"].progress_apply(embed)
faq_articles_df.dropna(subset=["title_embedding"], inplace=True)
questions_df["embedding"] = questions_df["question"].progress_apply(embed)

100%|██████████| 100/100 [00:00<00:00, 704.76it/s]
100%|██████████| 100/100 [00:00<00:00, 16508.46it/s]
100%|██████████| 28/28 [00:00<00:00, 3192.62it/s]


In [None]:
question_test_id = 1
question_test = questions_df.iloc[question_test_id]
print(f"Question: \"{question_test['question']}\"")

custom_embedding = embedding(torch.tensor(custom_tokenizer(question_test["question"]))).mean(dim=0).cpu().detach()

cos_sim = torch.nn.CosineSimilarity(dim=0)
def distance(html_text:str):
    with torch.no_grad():
        tokenized = custom_tokenizer(html_text)
        if len(tokenized)==0:
            return -1
        text_embedded = embedding(torch.tensor(tokenized))
        # print("======")
        # print(text_embedded.shape, question_test_embedded.shape)
        # print(text_embedded.mean(dim=0).shape, question_test_embedded.mean(dim=0).shape)
        # print("======")
        similarity = cos_sim(text_embedded.mean(dim=0), custom_embedding).item()
    return similarity

def string_similarity(serie: pd.Series):
    _df_similarity_score = serie.progress_apply(distance)
    # _df_similarity_score.dropna(inplace=True)
    # soft_max = torch.nn.Softmax(dim=0)
    # _df_similarity_score = pd.Series(soft_max(torch.tensor(_df_similarity_score.values)))
    return _df_similarity_score

tmp_df = pd.DataFrame(faq_articles_df[["html_content", "title"]].copy(deep=True))
# tmp_df["score_html_text"] = string_similarity(tmp_df["html_text_content"])
tmp_df["score_title"] = string_similarity("title: " + tmp_df["title"] + "\n content: " + tmp_df["html_content"])

tmp_df.sort_values(by="score_title", ascending=False, inplace=True)
print(tmp_df)
tmp_df.to_csv("./tmp.csv")

Question: "Are there any news on when AC Jade will be available?"


  0%|          | 0/100 [00:00<?, ?it/s]

                                         html_content  \
70  When DedSec operatives are critically injured,...   
8   Following the shutdown of Stadia on 18 January...   
19  This error occurs when Ubisoft Connect PC is u...   
4   Decommissioning the online services for older ...   
57  As you progress through Sequence 3, Memory 4 (...   
..                                                ...   
90  In Roller Champions, you can customize your av...   
25  Back in 2017, For Honor made its debut and int...   
30  In Far Cry 6 you can change the language for t...   
48  セーブデータはクラウドサーバーに保存されます。Nintendo Switch本体／SDカード...   
60  To delete an Autodance:\n• Highlight the Just ...   

                                                title  score_title  
70   Healing injured operatives in Watch Dogs: Legion     0.246113  
8    Keeping your Stadia save files for Ubisoft games     0.241367  
19  Error message "A Ubisoft service is not availa...     0.196130  
4   Decommissioning of online services 

In [None]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

knn_title = NearestNeighbors(n_neighbors=5, metric="cosine")
data = faq_articles_df["title_embedding"].values
data = np.vstack(data)
knn_title.fit(data)

In [None]:
knn_content = NearestNeighbors(n_neighbors=5, metric="cosine")
data = faq_articles_df["content_embedding"].values
data = np.vstack(data)
knn_content.fit(data)

In [None]:
knn_content_and_title = NearestNeighbors(n_neighbors=5, metric="cosine")
def row_mean(row):
    data = np.vstack((row["content_embedding"], row["title_embedding"]))
    return np.average(data, axis=0, weights=[1./5, 4./5])

data = faq_articles_df.progress_apply(row_mean, axis=1).values
data = np.vstack(data)
print(data.shape, data)
knn_content_and_title.fit(data)

100%|██████████| 99/99 [00:00<00:00, 9857.70it/s]

(99, 1024) [[ 0.70666069  0.01596309 -0.03587186 ...  0.02388256  0.0124868
  -0.0896798 ]
 [-0.05860371 -0.28635466  0.3071129  ... -0.07745139  0.38517178
   0.2565737 ]
 [-0.64907342  0.29745751  0.41481947 ... -0.42120284  0.15131914
   0.19292997]
 ...
 [ 0.33056883 -0.54828329 -0.26837288 ... -0.09260171  0.80024089
  -0.80291674]
 [ 0.24524291  0.0338645  -0.09764073 ...  0.88006246  0.20052549
   0.04589768]
 [ 0.50445642  0.30209381  0.37284029 ... -0.27244188 -0.50273023
   0.03406248]]





In [None]:
question_test = questions_df.iloc[4]
print(question_test)
distances_title, indices_title = knn_content_and_title.kneighbors(question_test["embedding"].reshape(1, -1), n_neighbors=20)
print("Distances:", distances_title)
print(faq_articles_df.iloc[indices_title[0]])

question     How to report a toxic playing behavior in Mari...
embedding    [0.3089377, -0.56294197, 0.59237, 0.36797246, ...
Name: 4, dtype: object
Distances: [[0.6288601  0.68315991 0.71372585 0.73034581 0.74897224 0.76880563
  0.77773899 0.77875956 0.78872011 0.81164458 0.82560202 0.83098521
  0.83126708 0.83889213 0.84033573 0.84306354 0.84382378 0.84564189
  0.84878564 0.85453432]]
    articleID                                              title  \
14      64374  Accessing Challenges in Mario + Rabbids Kingdo...   
55      64832         Configuring sound levels in Rabbids Coding   
63      64847  Troubleshooting performance issues in Rabbids ...   
66      62974  Locating in-game content for Mario + Rabbids K...   
5       99331                   Reporting a bug in Ubisoft games   
52     104647                    Cross-progression in Trackmania   
13     103219  Code of Conduct for Mario + Rabbids Sparks of ...   
51      76017                 Activating Deep Ocean in Anno 2070  

In [None]:
question_test = questions_df.iloc[0]
print(question_test)
distances_title, indices_title = knn_title.kneighbors(question_test["embedding"].reshape(1, -1), n_neighbors=20)
print("Distances:", distances_title)
print(faq_articles_df.iloc[indices_title[0]])

question                      How to make an explosive in FC5?
embedding    [-0.49198765, -0.3451201, 0.43429187, -0.08201...
Name: 0, dtype: object
Distances: [[0.7652691  0.7693465  0.77511686 0.8015623  0.80417955 0.8079766
  0.81531423 0.8159057  0.819003   0.82034844 0.82349557 0.82726014
  0.82995915 0.8345226  0.83769727 0.8382895  0.83904624 0.8414052
  0.8468493  0.849199  ]]
    articleID                                             title  \
59      99744              Language options in Monopoly Madness   
52     104647                   Cross-progression in Trackmania   
37      61584                   Hiding games in Ubisoft Connect   
30      98998            Changing language options in Far Cry 6   
81      99058       Cross-platform content sharing in Far Cry 6   
90     101948   Customising your appearance in Roller Champions   
60      77221     Deleting Autodance videos in Just Dance games   
16      80983                       Cross-play in Trials Rising   
78      6

In [None]:
distances_title, indices_title = knn_content.kneighbors(question_test["embedding"].reshape(1, -1), n_neighbors=20)
print("Distances:", distances_title)
print(faq_articles_df.iloc[indices_title[0]])

Distances: [[0.7254748  0.7410608  0.74405205 0.7606025  0.76504713 0.7806391
  0.78327376 0.8004626  0.8050662  0.8061143  0.8147876  0.8157264
  0.8247873  0.8261517  0.82665867 0.8341482  0.8348928  0.8367113
  0.83753014 0.8413661 ]]
    articleID                                              title  \
53      97180              Manually updating older Ubisoft games   
99      60487        Online Multiplayer Settings in Watch Dogs 2   
92     102629  Unlocking outfits in Niflheim for Assassin's C...   
11      65295  White circles on the ground in Discovery Tour:...   
44      62780  Locating the quest "Righting a Wrong" (Lost Ta...   
54      79352  Silvers and Helix Credits in Assassin's Creed ...   
38      61881       Configuring sound levels in Far Cry New Dawn   
35     101322        Code of Conduct for Trivial Pursuit Live! 2   
60      77221      Deleting Autodance videos in Just Dance games   
76      60916            Contents of the Anno 1800 Season 1 Pass   
9       62697 

# Pre-trained model based

In [None]:
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available else "cpu"

model_name = "facebook/xlm-v-base" #"xlm-roberta-base"  # or any other suitable pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/18.2M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/61.4M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.12G [00:00<?, ?B/s]

Some weights of XLMRobertaModel were not initialized from the model checkpoint at facebook/xlm-v-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def get_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = torch.mean(outputs.last_hidden_state, dim=1)  # Mean pooling
    return embeddings.cpu().detach().numpy()

faq_articles_df["content_embedding"] = faq_articles_df["html_content"].progress_apply(get_embeddings)
faq_articles_df.dropna(subset=["content_embedding"], inplace=True)
faq_articles_df["title_embedding"] = faq_articles_df["title"].progress_apply(get_embeddings)
faq_articles_df.dropna(subset=["title_embedding"], inplace=True)
questions_df["embedding"] = questions_df["question"].progress_apply(get_embeddings)

  0%|          | 0/100 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/28 [00:00<?, ?it/s]

In [None]:
faq_articles_df["title_content"] = "title: " + faq_articles_df["title"] + "\n content: " + faq_articles_df["html_content"]
faq_articles_df["title_content_embedding"] = faq_articles_df["title_content"].progress_apply(get_embeddings)

  0%|          | 0/100 [00:00<?, ?it/s]

### KNN

In [None]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

knn_title = NearestNeighbors(n_neighbors=5, metric="cosine")
data = faq_articles_df["title_embedding"].values
data = np.vstack(data)
knn_title.fit(data)

knn_content = NearestNeighbors(n_neighbors=5, metric="cosine")
data = faq_articles_df["content_embedding"].values
data = np.vstack(data)
knn_content.fit(data)

knn_titlecontent = NearestNeighbors(n_neighbors=1, metric="cosine")
data = faq_articles_df["title_content_embedding"].values
data = np.vstack(data)
print(data)
knn_titlecontent.fit(data)

[[ 0.00024993  0.005599    0.01671557 ...  0.02056915 -0.00028062
   0.1087667 ]
 [ 0.02159384  0.01569649  0.02169064 ...  0.12161361 -0.02437468
   0.08214884]
 [ 0.02215911  0.00654834  0.00385957 ...  0.07124063  0.01232011
   0.12010536]
 ...
 [-0.02619703 -0.01498317  0.01656552 ...  0.07921132  0.0367736
   0.06971673]
 [ 0.0090782   0.02640423 -0.00738526 ...  0.1068259   0.0082129
   0.12073851]
 [ 0.02965323  0.00925691  0.00635692 ...  0.07847287  0.00447676
   0.08610021]]


In [None]:
question_test = questions_df.iloc[0]
print(question_test.question)
distances_title, indices_title = knn_titlecontent.kneighbors(question_test["embedding"].reshape(1, -1), n_neighbors=1)
print("Distances:", distances_title)

def display(row):
    print("==========================")
    print("title_content:", row["title_content"])
    print("==========================\n")

faq_articles_df.iloc[indices_title[0]].apply(display, axis=1)
print()

How to make an explosive in FC5?
Distances: [[0.0064888]]
title_content: title: Error Code Bookworm-BE8A522E in Far Cry 6
 contentThis message occurs when there has been an issue accepting a co-op session invite from another player.
You can resolve this by restarting the game.




In [None]:
def similarity_search(question:str):
    _, indices_title = knn_titlecontent.kneighbors(question_test["embedding"].reshape(1, -1), n_neighbors=1)
    return faq_articles_df.iloc[indices_title[0][0]]["articleID"]
questions_df["article_id"] = questions_df["question"].progress_apply(similarity_search)

  0%|          | 0/28 [00:00<?, ?it/s]

In [None]:
questions_df

Unnamed: 0,question,embedding,article_id
0,How to make an explosive in FC5?,"[[-0.01766024, 0.040068943, 0.011997369, 0.007...",99148
1,Are there any news on when AC Jade will be ava...,"[[-0.042750146, 0.0049722022, 0.01994016, 0.01...",99148
2,Is it possible to continue my progression if I...,"[[-0.01710297, 0.02514645, 0.0048631486, -0.02...",99148
3,Will Starlink run on a GPU with 2GB RAM? Is th...,"[[-0.027429102, -0.002572644, 0.040274885, 0.0...",99148
4,How to report a toxic playing behavior in Mari...,"[[0.03184674, 0.011886748, 0.031098751, 0.0179...",99148
5,I have an error BE8A522E. What does it mean?,"[[0.011777533, 0.05218943, 0.0007697771, -0.00...",99148
6,"I wanted to play HoMM IV after a long break, b...","[[-0.013195799, -0.009787414, -0.009858473, -0...",99148
7,How to sign up for Just Dance+?,"[[0.03481711, 0.041446667, 0.022202183, 0.0385...",99148
8,I can't find the Discovery Tour in ACV despite...,"[[-0.03882204, 0.011960899, -0.0033591096, 0.0...",99148
9,Comment puis-je informer les développeurs d'Ub...,"[[-0.015457971, 0.018980213, 0.027896011, 0.00...",99148


### Manual Cosine method

In [None]:
# m = torch.nn.CosineSimilarity(dim=1)
from sklearn.metrics.pairwise import cosine_similarity


def similarity_search(question_embedding):
    data = faq_articles_df[["title_content", "title_embedding", "content_embedding", "title_content_embedding", "articleID"]].copy(deep=True)
    # data["score"] = data["title_embedding"].apply(lambda emb: m(torch.tensor(question_embedding), torch.tensor(emb)).item())
    # data["score"] = data["title_embedding"].apply(lambda emb: cosine_similarity(question_embedding, emb)[0,0])
    data["score"] = data["title_content_embedding"].apply(lambda emb: torch.cdist(torch.tensor(question_embedding), torch.tensor(emb)).item())
    data.sort_values(by=["score"], inplace=True, ascending=True)
    print(data)
    return data.iloc[0:5]
id=10
print("question:", questions_df.iloc[id].question)
# print("question:", faq_articles_df.iloc[id].title)
# result = faq_articles_df.iloc[id:id+1]["title_embedding"].progress_apply(similarity_search)
result = questions_df.iloc[id:id+1]["embedding"].progress_apply(similarity_search)
print(result.iloc[0].title_content)

question: Est-il possible d'obtenir un Alpha Pack gratuitement dans R6S ?


  0%|          | 0/1 [00:00<?, ?it/s]

                                        title_content  \
9   title: Location of Secrets of the First Pyrami...   
94  title: Ubisoft Store - Issue exchanging Ubisof...   
39  title: Contents of Far Cry 3 Classic Edition (...   
16  title: Cross-play in Trials Rising\n content: ...   
43  title: Accessing the quest Test of Judgment in...   
..                                                ...   
58  title: System requirements for Rocksmith+ (PC)...   
74  title: System requirements for Assassin's Cree...   
32  title: System requirements for Watch Dogs 2\n ...   
62  title: Troubleshooting technical issues on Pla...   
77  title: System requirements for Assassin's Cree...   

                                      title_embedding  \
9   [[0.0005317729, -0.00161402, 0.008354791, -0.0...   
94  [[0.0026955556, -0.016921623, 0.010726158, -0....   
39  [[0.007894096, -0.009946631, 0.0016931917, 0.0...   
16  [[0.0049432893, -0.0030729354, 0.006644571, 0....   
43  [[0.0028395262, -0.0012925

In [None]:
m = torch.nn.CosineSimilarity(dim=1)


def similarity_search(question_embedding):
    data = faq_articles_df[["title", "title_content_embedding", "articleID"]].copy(deep=True)
    data["score"] = data["title_content_embedding"].apply(lambda emb: m(torch.tensor(question_embedding), torch.tensor(emb)).item())
    data.sort_values(by=["score"], inplace=True, ascending=False)
    return data.iloc[0].articleID, data.iloc[0].title

questions_df["article_id", "article_title"] = questions_df["embedding"].progress_apply(similarity_search)

  0%|          | 0/28 [00:00<?, ?it/s]

In [None]:
questions_df

Unnamed: 0,question,embedding,article_id,"(article_id, article_title)"
0,How to make an explosive in FC5?,"[[0.005815069, -0.0032068184, 0.009311468, 0.0...",62697,"(62697, Location of Secrets of the First Pyram..."
1,Are there any news on when AC Jade will be ava...,"[[0.0056408206, -0.008635589, 0.0058236443, 0....",105862,"(105862, Ubisoft Store - Issue exchanging Ubis..."
2,Is it possible to continue my progression if I...,"[[0.004945445, 0.0048120166, 0.011660387, -0.0...",102641,"(102641, Niflheim currencies in Assassin's Cre..."
3,Will Starlink run on a GPU with 2GB RAM? Is th...,"[[0.0070026894, -0.0045886934, 0.012220346, -0...",99148,"(99148, Error Code Bookworm-BE8A522E in Far Cr..."
4,How to report a toxic playing behavior in Mari...,"[[0.001433985, 0.002091226, 0.00673151, 0.0043...",60916,"(60916, Contents of the Anno 1800 Season 1 Pass)"
5,I have an error BE8A522E. What does it mean?,"[[0.012558115, -0.0032170757, 0.00033806168, -...",99148,"(99148, Error Code Bookworm-BE8A522E in Far Cr..."
6,"I wanted to play HoMM IV after a long break, b...","[[0.0006757137, -0.010852469, 0.009820812, -0....",61934,"(61934, Equinox upgrades)"
7,How to sign up for Just Dance+?,"[[-0.0025944444, -0.0012060399, 0.009584829, -...",62697,"(62697, Location of Secrets of the First Pyram..."
8,I can't find the Discovery Tour in ACV despite...,"[[0.0034633633, -0.010108132, 0.017663503, -0....",62780,"(62780, Locating the quest ""Righting a Wrong"" ..."
9,Comment puis-je informer les développeurs d'Ub...,"[[0.014862637, -0.0033176357, 0.0042878105, -0...",62697,"(62697, Location of Secrets of the First Pyram..."


In [None]:
questions_df.to_csv("./output.csv", columns=["question", "article_id"])

The result given by this method is very deceiving, unfortunately. I spent too much time trying to understand what is the issue with it and decided to stop after searching for 5h.

I have tried different methods from my knowledge but haven't figured out a way to fix this. In the past I had the occasion to use this method successfully without any issue, but it seems that here, it doesn't work as expected.