# **Generating the Word Embeddings for users across the askscience subreddit**

This notebook considers 1 subreddit - askscience. For each user in a subreddit, it calculates the mean word embedding from all his comments. Then it calculates it generates the cosine similarity of language use between user pairs.

This notebook uses the SentenceBert.


---

**Part 1: Reading the data**<br>
**Part 2: Generate the word embeddings** <br>
In this section, I have read the entire subreddit. Generate a word embedding for every comment, and store it in the 'data_word_embeddings' dataframe<br>
**Part 3: Find Cultural Similarity between user pairs**<br>
  For every user pair, find the average cosine similarity between their word embeddings<br>

---

OUTPUT FILES:<br>
1. 'data_askscience_comment_level_culsim.csv': contains the cultural similarity with comments which have a valid parent (thus user pairs are repeated)<br>

2. 'data_askscience_user_pair_level_culsim.csv': conatains the average cultural similarity for each unique user pair (took the mean of the first output file across user pairs to obtain the average cosine similarity)

.


.


---
# **Part 1: Reading the data**

In this section, I have read the askscience subreddit.

---
.

Check if cuda is being used

In [1]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cuda.


Connect to drive

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


We already processed the data into a json file in the 'Node_Embeddings_For_askscience_Subreddits.ipynb' file. So we directly read the processed file. Ignore this if you already combined submissions and comments.

In [None]:
import pandas as pd
data_askscience = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience.csv', low_memory=False)
print(len(data_askscience)) #length of data = 26605
print(len(pd.unique(data_askscience['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_askscience['id']))) #unique number of comments = 26605 #the data is at the comment level
print(len(pd.unique(data_askscience['parent_id']))) #number of parent nodes = 10538
print(len(pd.unique(data_askscience['link_id']))) #number of submissions = 3004
print(len(pd.unique(data_askscience['author']))) #number of submissions = 6629
print(len(data_askscience.columns)) # = 12

26605
1
26605
10538
3004
6629
12


In [4]:
print(pd.unique(data_askscience['subreddit']))

['askscience']


there is only one subreddit

note: the 'id' column seems to be the comment id, whereas the 'parent_id' comment seems to be a link to the parent comment. In this data the parent id is not preceded by the 3 characters

need to remove the rows where body is 'removed' and author is 'deleted', or some combination of the two

In [None]:
data_askscience.loc[data_askscience.author == '[deleted]', 'author'].count()

13334

In [None]:
data_askscience.loc[data_askscience.author == '[removed]', 'author'].count()

0

In [None]:
data_askscience.loc[data_askscience.body == '[deleted]', 'body'].count()

369

In [None]:
data_askscience.loc[data_askscience.body == '[removed]', 'body'].count()

12926

these need to be removed

In [None]:
data_askscience = data_askscience[data_askscience['body'] != '[removed]']
data_askscience = data_askscience[data_askscience['body'] != '[deleted]']
print(len(data_askscience))
print(len(pd.unique(data_askscience['author'])))
print(data_askscience.head(3))

13310
6628
        id   subreddit                                               body  \
0  iqker6l  askscience  No it does not imply that. “We don’t yet know”...   
1  iqkewq0  askscience  while insect muscle might be similar to ours s...   
3  iqkfl8j  askscience  Pasteurization works by heating (generally a l...   

                author  score  gilded  created_utc parent_id link_id  \
0          omniskeptic      2       0   1664582942   iqkee0k  xs73nx   
1  regular_modern_girl    452       0   1664583016   iqjssf5  xs9pjy   
3       jeweledjuniper     11       0   1664583360   iqke0xc  xs1k1y   

   retrieved_on  controversiality  is_submitter  
0    1664960533                 0         False  
1    1664960528                 0         False  
3    1664960508                 0         False  


In [None]:
data_askscience = data_askscience[data_askscience['author'] != '[removed]']
data_askscience = data_askscience[data_askscience['author'] != '[deleted]']
print(len(data_askscience))
print(len(pd.unique(data_askscience['author'])))
print(data_askscience.head(3))

13270
6627
        id   subreddit                                               body  \
0  iqker6l  askscience  No it does not imply that. “We don’t yet know”...   
1  iqkewq0  askscience  while insect muscle might be similar to ours s...   
3  iqkfl8j  askscience  Pasteurization works by heating (generally a l...   

                author  score  gilded  created_utc parent_id link_id  \
0          omniskeptic      2       0   1664582942   iqkee0k  xs73nx   
1  regular_modern_girl    452       0   1664583016   iqjssf5  xs9pjy   
3       jeweledjuniper     11       0   1664583360   iqke0xc  xs1k1y   

   retrieved_on  controversiality  is_submitter  
0    1664960533                 0         False  
1    1664960528                 0         False  
3    1664960508                 0         False  


confirm that there are no more of the incorrect rows

In [None]:
print(data_askscience.loc[data_askscience.author == '[deleted]', 'author'].count())
print(data_askscience.loc[data_askscience.author == '[removed]', 'author'].count())
print(data_askscience.loc[data_askscience.body == '[deleted]', 'body'].count())
print(data_askscience.loc[data_askscience.body == '[removed]', 'body'].count())

0
0
0
0


continue from here is you have already executed 'Submissions_Processing_askscience.ipynb' and have a processed file called 'data_merged_askscience.csv'. <br>

Ignore below if you have been processing comments using the above code in Part 1 and did not run 'Submissions_Processing_askscience.ipynb'.

In [6]:
import pandas as pd
data_askscience = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_merged_askscience.csv', low_memory=False)
print(len(data_askscience)) #length of data =
print(len(pd.unique(data_askscience['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_askscience['id']))) #unique number of comments = #the data is at the comment/submission level
print(len(pd.unique(data_askscience['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_askscience['link_id']))) #number of submissions =
print(len(pd.unique(data_askscience['author']))) #number of submissions =
print(len(data_askscience.columns)) # = 12

18949
1
18949
7790
2867
10895
17


In [7]:
data_askscience.head(3)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,parent_id,link_id,controversiality,is_submitter
0,xsjqzy,askscience,Why do I poop after a glass or two of beer or ...,depressedchiq,1,0,1664591533,1665426496,/r/askscience/comments/xsjqzy/why_do_i_poop_af...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True
1,xsju1p,askscience,If you heated water under immense pressure so ...,zhongliabuse,1,0,1664591793,1665426492,/r/askscience/comments/xsju1p/if_you_heated_wa...,1,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True
2,xsjx0l,askscience,What is the evolutionary goal of nose elongati...,_ozeki,1,0,1664592050,1665426489,/r/askscience/comments/xsjx0l/what_is_the_evol...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True


In [8]:
data_askscience.tail(3)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,parent_id,link_id,controversiality,is_submitter
18946,iuk9rbq,askscience,I dont understand why you think a local geolog...,tricksterwolf,5,0,1667259694,1667844629,0,0,0,0,0,iuj9y8f,yiizwf,0,False
18947,iuk9uu7,askscience,&gt;but sometimes bacteria wait until they hav...,tedivm,4,0,1667259740,1667844624,0,0,0,0,0,iujz6c0,yia9a5,0,False
18948,iuk9xwj,askscience,Not quite answering the question but fever is ...,lost_in_antartica,2,0,1667259779,1667844621,0,0,0,0,0,yi3t9o,yi3t9o,0,False


how many rows are comments

In [8]:
data_askscience.loc[data_askscience.is_submitter == False, 'is_submitter'].count()

how many rows are submissions

In [9]:
data_askscience.loc[data_askscience.is_submitter == True, 'is_submitter'].count()

6031

12918 + 6031 = 18949

---
# **Part 2: Generate the word embeddings**

In this section, I have finished reading the askscience subreddit. Now, generate a word embedding for every comment, and store it in the 'data_embeddings' dataframe

---
.

Import libraries

In [10]:
!pip install -U sentence-transformers
# load tqdm
!pip install --force https://github.com/chengs/tqdm/archive/colab.zip
!pip install swifter

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=8bfc9776b0630b14ee3c17c6d05ad033a420a3a13c9c0502abbc9dcf6ed9b9c1
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-tr

Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tqdm>=4.33.0 (from swifter)
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.4.0-py3-none-any.whl size=16506 sha256=ca4287de1c1ee11336434b9d541a92ba1ffe99b41e90d2d7587c7ee3e8107537
  Stored in directory: /root/.cache/pip/wheels/e4/cf/51/0904952972ee2c7aa3709437065278dc534ec1b8d2ad41b443
Successfully built swifter
Installing collected packages: tqdm, swifter
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.28.1
    Uninstalling tqdm-4.28.1:
     

In [11]:
import numpy as np
import pandas as pd
import nltk
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses, util
from torch.utils.data import DataLoader
from sentence_transformers import losses
import os
import swifter
from nltk.tokenize import sent_tokenize
import torch
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [12]:
print(len(data_askscience))
data_askscience.head(3)

18949


Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,parent_id,link_id,controversiality,is_submitter
0,xsjqzy,askscience,Why do I poop after a glass or two of beer or ...,depressedchiq,1,0,1664591533,1665426496,/r/askscience/comments/xsjqzy/why_do_i_poop_af...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True
1,xsju1p,askscience,If you heated water under immense pressure so ...,zhongliabuse,1,0,1664591793,1665426492,/r/askscience/comments/xsju1p/if_you_heated_wa...,1,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True
2,xsjx0l,askscience,What is the evolutionary goal of nose elongati...,_ozeki,1,0,1664592050,1665426489,/r/askscience/comments/xsjx0l/what_is_the_evol...,0,https://www.reddit.com/r/askscience/comments/x...,[removed],True,0,0,novalue,True


clean up the data

First take only the 'author' and 'body' columns

In [13]:
data = data_askscience
data_word_embeddings = data[['id','author','body']]
print(len(data_word_embeddings)) #length of data = 18949
print(len(pd.unique(data_word_embeddings['id'])))
print(len(pd.unique(data_word_embeddings['author']))) #number of authors = 10895
print(data_word_embeddings.columns) #only three columns

18949
18949
10895
Index(['id', 'author', 'body'], dtype='object')


drop comments with missing body

In [14]:
data_word_embeddings = data_word_embeddings.dropna(subset=['body'])
print(len(data_word_embeddings))
data_word_embeddings.head(3)

18949


Unnamed: 0,id,author,body
0,xsjqzy,depressedchiq,Why do I poop after a glass or two of beer or ...
1,xsju1p,zhongliabuse,If you heated water under immense pressure so ...
2,xsjx0l,_ozeki,What is the evolutionary goal of nose elongati...


so there were no comments with missing body, as the length has remained the same

delete those comments which have body as '[deleted]'. We see that there are 0 such rows.

In [15]:
data_word_embeddings[data_word_embeddings['body'] == '[deleted]']

Unnamed: 0,id,author,body


using a pretrained SBERT model to encode the sentences rather than training on the reddit data itself

In [16]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('all-mpnet-base-v2')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

took 5 min on a V100 GPU

In [17]:
embeddings_list = []
j = 0

for ind, row in data_word_embeddings.iterrows():
  j += 1
  if j % 5000 == 0:
    print('created embedding for '+str(ind)+'/'+str(len(data_word_embeddings))+' comments')

  curr_comment = row['body']

  sentence_embedding = sbert_model.encode(curr_comment)
  embeddings_list.append(sentence_embedding)

data_word_embeddings['word_embedding'] = embeddings_list

print(len(data_word_embeddings))
print(data_word_embeddings.head(3))

created embedding for 4999/18949 comments
created embedding for 9999/18949 comments
created embedding for 14999/18949 comments
18949
       id         author                                               body  \
0  xsjqzy  depressedchiq  Why do I poop after a glass or two of beer or ...   
1  xsju1p   zhongliabuse  If you heated water under immense pressure so ...   
2  xsjx0l         _ozeki  What is the evolutionary goal of nose elongati...   

                                      word_embedding  
0  [-0.01071637, 0.022951962, 0.010885665, -0.062...  
1  [-0.041385975, -0.0071237665, 0.0037062431, -0...  
2  [-0.0027840026, -0.019852495, -0.041550636, -0...  


In [18]:
#print the length of the first row embedding
print(data_word_embeddings['word_embedding'][0])
print(len(data_word_embeddings['word_embedding'][0])) #length is 768

[-1.07163703e-02  2.29519624e-02  1.08856652e-02 -6.23611622e-02
 -5.89358060e-05  7.17235580e-02 -1.16350511e-02 -3.38213220e-02
  4.56495862e-03  2.14390159e-02  3.12782861e-02 -6.72612190e-02
 -4.18186374e-02  1.74272582e-02  5.12304567e-02  8.91139433e-02
 -1.16402814e-02  3.29305902e-02  6.15883991e-03 -2.83085089e-02
 -1.24764023e-02 -1.80168953e-02 -3.28141823e-02  8.92432407e-03
  9.23664588e-03 -4.22556251e-02 -9.73174907e-03 -2.30077896e-02
 -6.78109704e-03 -9.23010334e-03  1.01532806e-02  3.40554304e-02
 -3.36482078e-02  6.43406287e-02  8.58616147e-07 -3.62515333e-03
 -2.99272016e-02  3.67333777e-02 -3.33294533e-02  5.90517223e-02
 -1.94965936e-02  6.17460683e-02  1.98648777e-02 -5.93526871e-04
  1.50808543e-02  2.22576540e-02  4.91396151e-02  5.45625687e-02
  4.42653969e-02  4.59404057e-03 -7.17096590e-03 -5.02918549e-02
 -5.50887473e-02 -7.01401606e-02  1.10881180e-01  3.09732221e-02
  1.07591075e-03  1.11946941e-03 -1.03513943e-02 -1.40975406e-02
  3.70865092e-02 -4.09221

---
# **Part 3: Find Cultural Similarity between user pairs**

For every comment, find the **average** cosine similarity between their word embeddings.


---
.

In [19]:
#cul_sim_results = pd.DataFrame(columns=['subreddit_id','id','from_user','to_user','cultural_similarity']) #ignore the 'subreddit_id' as there is only one subreddit
cul_sim_results = pd.DataFrame(columns=['id','from_user','to_user','cultural_similarity'])
print(cul_sim_results)

Empty DataFrame
Columns: [id, from_user, to_user, cultural_similarity]
Index: []


function to find cosine similarity

In [20]:
import numpy as np
def cosine_sim(vector1, vector2):
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

In [21]:
def cultural_similarity_function(input_data):

  ignore_comments_counter = 0
  j = 0
  no_submissions = 0

  #additional code to resolve an error
  type_base = type(input_data['parent_id'].iloc[0])

  for ind, row in input_data.iterrows():
    j += 1
    if j % 10000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    curr_comment_id = row['id']
    curr_author = row['author']
    #curr_subreddit_id = row['subreddit_id']
    curr_id = row['id']
    if type(row['parent_id']) != type_base:
      ignore_comments_counter
      continue

    if row['is_submitter'] == True: #the row is a submission
      no_submissions += 1
      continue

    #curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters
    curr_parent_comment_id = row['parent_id'] #in this data there are no preceding 3 characters for parent_id. It is directly a comment_id

    #find the parent comment
    #identify if a comment is a primary comment
    primary_comment_flag = 0
    if row['parent_id'] == row['link_id']: #it is a primary comment
      #print("primary comment")
      primary_comment_flag = 1
    if(len(input_data[input_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
      ignore_comments_counter += 1
      continue
    primary_comment_flag = 0
    curr_parent = input_data[input_data['id'] == curr_parent_comment_id]['author'].values[0]

    #find the word embeddings of sender and reciver and hence cosine similarity
    if len(data_word_embeddings[data_word_embeddings['id'] == curr_comment_id]['word_embedding'].values) == 0: #could not find the embedding of the sender
      ignore_comments_counter += 1
      continue
    else:
      from_user_embedding = data_word_embeddings[data_word_embeddings['id'] == curr_comment_id]['word_embedding'].values[0]

    if len(data_word_embeddings[data_word_embeddings['id'] == curr_parent_comment_id]['word_embedding'].values) == 0: #could not find the embedding of the receiver
      ignore_comments_counter += 1
      continue
    else:
      to_user_embedding = data_word_embeddings[data_word_embeddings['id'] == curr_parent_comment_id]['word_embedding'].values[0]

    if j == 0:
      print(from_user_embedding.shape)
      print(to_user_embedding.shape)
    cultsim = cosine_sim(from_user_embedding, to_user_embedding)

    #cul_sim_results.loc[len(cul_sim_results.index)] = [curr_subreddit_id, curr_id, curr_author, curr_parent, cultsim] we dont have a subreddit_id
    cul_sim_results.loc[len(cul_sim_results.index)] = [curr_id, curr_author, curr_parent, cultsim]

  print('total number of comments ignored: ' +str(ignore_comments_counter))
  print('no of submissions = '+str(no_submissions))
  return cul_sim_results

data_culsim_user_pairs = cultural_similarity_function(data)

finished comment 10000/18949
total number of comments ignored: 1136
no of submissions = 6031


at this stage, we have the cultural similarity for every comment (the cosine similarity of embeddings between the sender comment and the reciever comment). However, we actually want the *average* cosine similarity between the given user pair. So we now take the mean across user pairs to obtain an average cosine similarity between the given user pair. (We will then feed this into the comment data set, and each comment will be associated with one average cosine similarity value). The average value is important because for each comment, we want the average cosine similarities (not just the cosine similarity representing that one comment) between the sender and reciever.

In [22]:
print(len(data_culsim_user_pairs))
print(data_culsim_user_pairs.head(3))

11782
        id       from_user      to_user  cultural_similarity
0  iqker6l     omniskeptic       chop1n             0.432118
1  iqkfl8j  jeweledjuniper    feitingen             0.642043
2  iqkfmj9          chop1n  omniskeptic             0.421561


confirm that user pairs are repeated

In [23]:
data_culsim_user_pairs.duplicated(subset=['from_user', 'to_user']).sum()

682

save the user-pair level per comment cultural similarity as a csv

In [24]:
data_culsim_user_pairs.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_comment_level_culsim.csv')

In [25]:
data_culsim_user_pairs = data_culsim_user_pairs.groupby(['from_user','to_user'], as_index=False)['cultural_similarity'].mean()

In [26]:
print(len(data_culsim_user_pairs))
print(data_culsim_user_pairs.head(3))

11100
      from_user         to_user  cultural_similarity
0     --tenet--   automoderator             0.118791
1  -1kingkrool-        web-dude             0.604204
2      -banned-  stimulatedecho             0.198554


this dataset has 11100 user pairs (the cultural similarity was found for 11782 comments, however there were 682 repeated user pairs in that)

Confirm that there are no repeated user pairs

In [27]:
data_culsim_user_pairs.duplicated(subset=['from_user', 'to_user']).sum()

0

In [28]:
data_culsim_user_pairs.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_user_pair_level_culsim.csv')