Text analysis
https://www.sbert.net/docs/usage/semantic_textual_similarity.html
Given the same information need, how different is query text of different users

How query text evolves through conversation.

Yes, you can compute similarity between queries with SBERT or some other sentence similarity methods.

# Creation of the list according to the amount of json file for each topic

In each list, you find in order the queries of each user for this specific topic.

This code should loop over each file in the topic1_phone directory and check if it's a JSON file. If it is, it loads the JSON data and extracts the messages that were sent by the specified users (in this case, U04RH7PFJ6Q and YHEJSIM5). It then prints the messages and adds them to a list with a unique name (list1, list2, etc.).

In [79]:

import os
import json



# Define the users we're interested in
users = ["U04RH7PFJ6Q", "YHEJSIM5"]

# Define the directory containing the JSON files
directory = '/Users/alejandroospina/Desktop/1. USI/4. IV SEMESTER/1. Thesis/ConversationalThesis/SplitOfConversationbyTopic/textAnalysisByTopic/topic1_phone'

# Define a counter variable to generate unique list names
counter = 1

# Loop over each file in the directory
for filename in os.listdir(directory):
    # Check that the file is a JSON file
    if filename.endswith(".json"):
        # Open the file and load the JSON data
        with open(os.path.join(directory, filename)) as f:
            data = json.load(f)
            
        # Extract the messages from the data that were sent by the specified users
        messages = [msg["text"] for msg in data if msg["user"] in users]
        
        # Remove the first element of the messages list
        messages = messages[1:]
        
        # Rename the list with a unique name
        new_list_name = "list{}".format(counter)
        counter += 1
        globals()[new_list_name] = messages
        
        # Print the messages and the new list name
        print("New list name:", new_list_name)
        print(messages)
     


New list name: list1
['I would like to buy a new phone. What are the main aspects that I should consider?', 'My operating system of choice for a new phone is Android. I am looking for a phone with durable battery life, fast performance and a large storage.', 'What are the prices of the phones you described?', 'Based on the previous requirements for battery life, performance, and storage, could you find me Android phones that are below 400 dollars?', 'Out of these options, which one has the largest display and the prettier design?', 'What are their additional features? What does any of those 4 phones have that other phones do not have?', 'Finally, could you give me a top 10 of the best phones, regardless of any price range or operating system.', 'Thank you, I am satisfied with questions regarding phones']
list1
New list name: list2
['Hi again, can you help me choosing a new phone?', 'I would like to buy an iOS phone but which costs not more than 1200 euros.', 'What about iPhone 14?', 'D

# Calculate the similarity between the sentences of each user


The function iterate through the longest list and replace the missing elements in the shorter lists with a default value (such as 0)

the max_len variable is set to the length of the longest list. The inner loop then iterates through the first max_len elements of each list, even if some of the lists have fewer elements than that. If a list has fewer than max_len elements, it is padded with the default value (0) using the extend method.

Note that padding the shorter lists with a default value may affect the cosine similarity scores. In particular, if the default value is significantly different from the actual values in the list, it may artificially reduce the similarity score between pairs of items that have missing values.

In [80]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd

# set pandas options to display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# initialize the SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# list of lists to compare
lists = [list1, list2]          ##  --> put here the new list that sort from te previous code

# get the length of the longest list
max_len = max(len(l) for l in lists)

# create a list of pairs and scores
pairs = []
scores = []

# iterate through each pair of lists
for i in range(len(lists)-1):
    for j in range(i+1, len(lists)):
        
        # pad the shorter list with default value (0)
        if len(lists[i]) < max_len:
            lists[i].extend([0] * (max_len - len(lists[i])))
        if len(lists[j]) < max_len:
            lists[j].extend([0] * (max_len - len(lists[j])))
        
        # compute the embeddings for the two lists
        embeddings1 = model.encode(lists[i], convert_to_tensor=True)
        embeddings2 = model.encode(lists[j], convert_to_tensor=True)
        
        # compute cosine similarities between the embeddings
        cosine_scores = util.cos_sim(embeddings1, embeddings2)
        
        # add the pairs and scores to the lists
        for k in range(max_len):
            pairs.append((lists[i][k], lists[j][k]))
            scores.append(cosine_scores[k][k])

# create a pandas DataFrame with the pairs and scores
df = pd.DataFrame({'Pair': pairs, 'Score': scores})

# display the DataFrame
print(df)



                                                Pair           Score
0  (I would like to buy a new phone. What are the...  tensor(0.7079)
1  (My operating system of choice for a new phone...  tensor(0.4574)
2  (What are the prices of the phones you describ...  tensor(0.4064)
3  (Based on the previous requirements for batter...  tensor(0.2780)
4  (Out of these options, which one has the large...  tensor(0.2622)
5  (What are their additional features? What does...  tensor(0.0161)
6  (Finally, could you give me a top 10 of the be...  tensor(0.0028)
7  (Thank you, I am satisfied with questions rega...  tensor(0.0323)
