# **Generating the Word Embeddings for users across the askscience subreddit**

This notebook considers 1 subreddit - askscience. For each user in a subreddit, it calculates the mean word embedding from all his comments. Then it calculates it generates the cosine similarity of language use between user pairs.

This notebook uses the SentenceBert.


---

**Part 1: Reading the data**<br>
**Part 2: Generate the word embeddings** <br>
In this section, I have read the entire subreddit. Generate a word embedding for every comment, and store it in the 'data_word_embeddings' dataframe<br>
**Part 3: Find Cultural Similarity between user pairs**<br>
  For every user pair, find the average cosine similarity between their word embeddings<br>

---

OUTPUT FILES:<br>
1. 'data_askscience_comment_level_culsim.csv': contains the cultural similarity with comments which have a valid parent (thus user pairs are repeated)<br>

2. 'data_askscience_user_pair_level_culsim.csv': conatains the average cultural similarity for each unique user pair (took the mean of the first output file across user pairs to obtain the average cosine similarity)

.


.


---
# **Part 1: Reading the data**

In this section, I have read the askscience subreddit.

---
.

Check if cuda is being used

In [None]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cuda.


Connect to drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


We already processed the data into a json file in the 'Node_Embeddings_For_askscience_Subreddits.ipynb' file. So we directly read the processed file

In [None]:
import pandas as pd
data_askscience = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience.csv', low_memory=False)
print(len(data_askscience)) #length of data = 26605
print(len(pd.unique(data_askscience['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data_askscience['id']))) #unique number of comments = 26605 #the data is at the comment level
print(len(pd.unique(data_askscience['parent_id']))) #number of parent nodes = 10538
print(len(pd.unique(data_askscience['link_id']))) #number of submissions = 3004
print(len(pd.unique(data_askscience['author']))) #number of submissions = 6629
print(len(data_askscience.columns)) # = 12

26605
1
26605
10538
3004
6629
12


In [None]:
print(pd.unique(data_askscience['subreddit']))

['askscience']


there is only one subreddit

In [None]:
data_askscience.head(10)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,parent_id,link_id,retrieved_on,controversiality,is_submitter
0,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,iqkee0k,xs73nx,1664960533,0,False
1,iqkewq0,askscience,while insect muscle might be similar to ours s...,regular_modern_girl,452,0,1664583016,iqjssf5,xs9pjy,1664960528,0,False
2,iqkfdmz,askscience,[removed],[deleted],1,0,1664583252,iqkb49u,xs9pjy,1664960514,0,False
3,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,iqke0xc,xs1k1y,1664960508,0,False
4,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,iqker6l,xs73nx,1664960507,0,False
5,iqkfrm5,askscience,"PhD in yeast genetics here, so I’ve streaked t...",smallwhitedog,4,0,1664583450,xs1k1y,xs1k1y,1664960502,0,False
6,iqkfsgy,askscience,Others have given great reasons for why our si...,moewind420,6,0,1664583462,xs73nx,xs73nx,1664960501,0,False
7,iqkft4v,askscience,[removed],[deleted],1,0,1664583472,xs1k1y,xs1k1y,1664960501,0,False
8,iqkfzvn,askscience,Inside a living human isn’t lightless dark. Li...,sovietamerican,7,0,1664583564,iqk1nsq,xs4rhf,1664960495,0,False
9,iqkg3r0,askscience,Wordy is good. your explanation is helping me...,tonytoews,2,0,1664583615,iqk8u6o,xs73nx,1664960492,0,False


note: the 'id' column seems to be the comment id, whereas the 'parent_id' comment seems to be a link to the parent comment. In this data the parent id is not preceded by the 3 characters

need to remove the rows where body is 'removed' and author is 'deleted', or some combination of the two

In [None]:
data_askscience.loc[data_askscience.author == '[deleted]', 'author'].count()

13334

In [None]:
data_askscience.loc[data_askscience.author == '[removed]', 'author'].count()

0

In [None]:
data_askscience.loc[data_askscience.body == '[deleted]', 'body'].count()

369

In [None]:
data_askscience.loc[data_askscience.body == '[removed]', 'body'].count()

12926

these need to be removed

In [None]:
data_askscience = data_askscience[data_askscience['body'] != '[removed]']
data_askscience = data_askscience[data_askscience['body'] != '[deleted]']
print(len(data_askscience))
print(len(pd.unique(data_askscience['author'])))
print(data_askscience.head(3))

13310
6628
        id   subreddit                                               body  \
0  iqker6l  askscience  No it does not imply that. “We don’t yet know”...   
1  iqkewq0  askscience  while insect muscle might be similar to ours s...   
3  iqkfl8j  askscience  Pasteurization works by heating (generally a l...   

                author  score  gilded  created_utc parent_id link_id  \
0          omniskeptic      2       0   1664582942   iqkee0k  xs73nx   
1  regular_modern_girl    452       0   1664583016   iqjssf5  xs9pjy   
3       jeweledjuniper     11       0   1664583360   iqke0xc  xs1k1y   

   retrieved_on  controversiality  is_submitter  
0    1664960533                 0         False  
1    1664960528                 0         False  
3    1664960508                 0         False  


In [None]:
data_askscience = data_askscience[data_askscience['author'] != '[removed]']
data_askscience = data_askscience[data_askscience['author'] != '[deleted]']
print(len(data_askscience))
print(len(pd.unique(data_askscience['author'])))
print(data_askscience.head(3))

13270
6627
        id   subreddit                                               body  \
0  iqker6l  askscience  No it does not imply that. “We don’t yet know”...   
1  iqkewq0  askscience  while insect muscle might be similar to ours s...   
3  iqkfl8j  askscience  Pasteurization works by heating (generally a l...   

                author  score  gilded  created_utc parent_id link_id  \
0          omniskeptic      2       0   1664582942   iqkee0k  xs73nx   
1  regular_modern_girl    452       0   1664583016   iqjssf5  xs9pjy   
3       jeweledjuniper     11       0   1664583360   iqke0xc  xs1k1y   

   retrieved_on  controversiality  is_submitter  
0    1664960533                 0         False  
1    1664960528                 0         False  
3    1664960508                 0         False  


confirm that there are no more of the incorrect rows

In [None]:
print(data_askscience.loc[data_askscience.author == '[deleted]', 'author'].count())
print(data_askscience.loc[data_askscience.author == '[removed]', 'author'].count())
print(data_askscience.loc[data_askscience.body == '[deleted]', 'body'].count())
print(data_askscience.loc[data_askscience.body == '[removed]', 'body'].count())

0
0
0
0


---
# **Part 2: Generate the word embeddings**

In this section, I have finished reading the askscience subreddit. Now, generate a word embedding for every comment, and store it in the 'data_embeddings' dataframe

---
.

Import libraries

In [None]:
!pip install -U sentence-transformers
# load tqdm
!pip install --force https://github.com/chengs/tqdm/archive/colab.zip
!pip install swifter

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_trans

Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tqdm>=4.33.0 (from swifter)
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.4.0-py3-none-any.whl size=16506 sha256=8976e1238b529a767c4ebf227c5d1c27e805ad1da64de131860f69adf0e0c219
  Stored in directory: /root/.cache/pip/wheels/e4/cf/51/0904952972ee2c7aa3709437065278dc534ec1b8d2ad41b443
Successfully built swifter
Installing collected packages: tqdm, swifter
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.28.1
    Uninstalling tqdm-4.28.1:
    

In [None]:
import numpy as np
import pandas as pd
import nltk
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses, util
from torch.utils.data import DataLoader
from sentence_transformers import losses
import os
import swifter
from nltk.tokenize import sent_tokenize
import torch
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
print(len(data_askscience))
data_askscience.head(3)

13270


Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,parent_id,link_id,retrieved_on,controversiality,is_submitter
0,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,iqkee0k,xs73nx,1664960533,0,False
1,iqkewq0,askscience,while insect muscle might be similar to ours s...,regular_modern_girl,452,0,1664583016,iqjssf5,xs9pjy,1664960528,0,False
3,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,iqke0xc,xs1k1y,1664960508,0,False


clean up the data

First take only the 'author' and 'body' columns

In [None]:
data = data_askscience
data_word_embeddings = data[['id','author','body']]
print(len(data_word_embeddings)) #length of data = 26605
print(len(pd.unique(data_word_embeddings['id'])))
print(len(pd.unique(data_word_embeddings['author']))) #number of authors = 6629
print(data_word_embeddings.columns) #only three columns

13270
13270
6627
Index(['id', 'author', 'body'], dtype='object')


drop comments with missing body

In [None]:
data_word_embeddings = data_word_embeddings.dropna(subset=['body'])
print(len(data_word_embeddings))
data_word_embeddings.head(3)

13270


Unnamed: 0,id,author,body
0,iqker6l,omniskeptic,No it does not imply that. “We don’t yet know”...
1,iqkewq0,regular_modern_girl,while insect muscle might be similar to ours s...
3,iqkfl8j,jeweledjuniper,Pasteurization works by heating (generally a l...


so there were no comments with missing body, as the length has remained the same

delete those comments which have body as '[deleted]'. We see that there are 369 such rows.

In [None]:
data_word_embeddings[data_word_embeddings['body'] == '[deleted]']

Unnamed: 0,id,author,body


using a pretrained SBERT model to encode the sentences rather than training on the reddit data itself

In [None]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('all-mpnet-base-v2')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

took 5 min on a V100 GPU

In [None]:
embeddings_list = []
j = 0

for ind, row in data_word_embeddings.iterrows():
  j += 1
  if j % 5000 == 0:
    print('created embedding for '+str(ind)+'/'+str(len(data_word_embeddings))+' comments')

  curr_comment = row['body']

  sentence_embedding = sbert_model.encode(curr_comment)
  embeddings_list.append(sentence_embedding)

data_word_embeddings['word_embedding'] = embeddings_list

print(len(data_word_embeddings))
print(data_word_embeddings.head(3))

created embedding for 11258/13270 comments
created embedding for 20539/13270 comments
13270
        id               author  \
0  iqker6l          omniskeptic   
1  iqkewq0  regular_modern_girl   
3  iqkfl8j       jeweledjuniper   

                                                body  \
0  No it does not imply that. “We don’t yet know”...   
1  while insect muscle might be similar to ours s...   
3  Pasteurization works by heating (generally a l...   

                                      word_embedding  
0  [0.03344682, 0.10473628, -0.043019485, 0.00056...  
1  [-0.040533893, -0.07450384, 0.0006170612, 0.00...  
3  [0.018933902, 0.010380933, -0.01886834, -0.025...  


note: we are just printing the indices here, so thats why even though there are 13270 comments, the comment indices can go untill around 26000

In [None]:
#print the length of the first row embedding
print(data_word_embeddings['word_embedding'][0])
print(len(data_word_embeddings['word_embedding'][0])) #length is 768

[ 3.34468186e-02  1.04736283e-01 -4.30194847e-02  5.60511486e-04
  1.23505937e-02 -1.55451195e-02 -5.33590093e-02  5.78322820e-02
 -6.23850860e-02 -1.16540147e-02 -3.47376652e-02  2.08749175e-02
  1.22990552e-02  4.14814353e-02  3.59352417e-02 -1.06086144e-02
 -3.51381418e-03 -6.73249215e-02  6.33548424e-02  3.81984897e-02
 -1.03722624e-01 -8.87410901e-03 -1.11974133e-02 -2.60884762e-02
  4.14715968e-02 -3.57148498e-02 -6.52609169e-02  5.88509161e-03
 -3.71300057e-02 -1.74795352e-02 -4.66040187e-02  1.64387524e-02
  1.71704628e-02  8.05278029e-03  1.85813678e-06 -7.50383176e-03
  3.55025493e-02  3.36213782e-02 -6.94407448e-02 -3.87020782e-02
 -2.43431106e-02  6.59425044e-03  3.23935077e-02  1.92979891e-02
 -5.67544214e-02 -1.71654131e-02 -5.27759083e-02 -2.14929357e-02
 -4.07694466e-02  2.47575324e-02  2.18778173e-03  4.92853345e-03
  2.77845487e-02 -1.01895155e-02 -2.47169775e-03  2.85629444e-02
  3.62807959e-02 -4.52011488e-02 -1.50722107e-02  2.00451221e-02
 -4.17127460e-02  6.43573

---
# **Part 3: Find Cultural Similarity between user pairs**

For every comment, find the **average** cosine similarity between their word embeddings.


---
.

In [None]:
#cul_sim_results = pd.DataFrame(columns=['subreddit_id','id','from_user','to_user','cultural_similarity']) #ignore the 'subreddit_id' as there is only one subreddit
cul_sim_results = pd.DataFrame(columns=['id','from_user','to_user','cultural_similarity'])
print(cul_sim_results)

Empty DataFrame
Columns: [id, from_user, to_user, cultural_similarity]
Index: []


function to find cosine similarity

In [None]:
import numpy as np
def cosine_sim(vector1, vector2):
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

In [None]:
def cultural_similarity_function(input_data):

  ignore_comments_counter = 0
  j = 0

  #additional code to resolve an error
  type_base = type(input_data['parent_id'].iloc[0])

  for ind, row in input_data.iterrows():
    j += 1
    if j % 10000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    curr_comment_id = row['id']
    curr_author = row['author']
    #curr_subreddit_id = row['subreddit_id']
    curr_id = row['id']
    if type(row['parent_id']) != type_base:
      ignore_comments_counter
      continue
    #curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters
    curr_parent_comment_id = row['parent_id'] #in this data there are no preceding 3 characters for parent_id. It is directly a comment_id

    #find the parent comment
    #identify if a comment is a primary comment
    primary_comment_flag = 0
    if row['parent_id'] == row['link_id']: #it is a primary comment
      #print("primary comment")
      primary_comment_flag = 1
    if(len(input_data[input_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
      ignore_comments_counter += 1
      continue
    primary_comment_flag = 0
    curr_parent = input_data[input_data['id'] == curr_parent_comment_id]['author'].values[0]

    #find the word embeddings of sender and reciver and hence cosine similarity
    if len(data_word_embeddings[data_word_embeddings['id'] == curr_comment_id]['word_embedding'].values) == 0: #could not find the embedding of the sender
      ignore_comments_counter += 1
      continue
    else:
      from_user_embedding = data_word_embeddings[data_word_embeddings['id'] == curr_comment_id]['word_embedding'].values[0]

    if len(data_word_embeddings[data_word_embeddings['id'] == curr_parent_comment_id]['word_embedding'].values) == 0: #could not find the embedding of the receiver
      ignore_comments_counter += 1
      continue
    else:
      to_user_embedding = data_word_embeddings[data_word_embeddings['id'] == curr_parent_comment_id]['word_embedding'].values[0]

    if j == 0:
      print(from_user_embedding.shape)
      print(to_user_embedding.shape)
    cultsim = cosine_sim(from_user_embedding, to_user_embedding)

    #cul_sim_results.loc[len(cul_sim_results.index)] = [curr_subreddit_id, curr_id, curr_author, curr_parent, cultsim] we dont have a subreddit_id
    cul_sim_results.loc[len(cul_sim_results.index)] = [curr_id, curr_author, curr_parent, cultsim]

  print('total number of comments ignored: ' +str(ignore_comments_counter))
  return cul_sim_results

data_culsim_user_pairs = cultural_similarity_function(data)

finished comment 10000/13270
total number of comments ignored: 5318


at this stage, we have the cultural similarity for every comment (the cosine similarity of embeddings between the sender comment and the reciever comment). However, we actually want the *average* cosine similarity between the given user pair. So we now take the mean across user pairs to obtain an average cosine similarity between the given user pair. (We will then feed this into the comment data set, and each comment will be associated with one average cosine similarity value). The average value is important because for each comment, we want the average cosine similarities (not just the cosine similarity representing that one comment) between the sender and reciever.

In [None]:
print(len(data_culsim_user_pairs))
print(data_culsim_user_pairs.head(3))

7952
        id       from_user      to_user  cultural_similarity
0  iqker6l     omniskeptic       chop1n             0.432118
1  iqkfl8j  jeweledjuniper    feitingen             0.642043
2  iqkfmj9          chop1n  omniskeptic             0.421561


confirm that user pairs are repeated

In [None]:
data_culsim_user_pairs.duplicated(subset=['from_user', 'to_user']).sum()

474

save the user-pair level per comment cultural similarity as a csv

In [None]:
data_culsim_user_pairs.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_comment_level_culsim.csv')

In [None]:
data_culsim_user_pairs = data_culsim_user_pairs.groupby(['from_user','to_user'], as_index=False)['cultural_similarity'].mean()

In [None]:
print(len(data_culsim_user_pairs))
print(data_culsim_user_pairs.head(3))

7478
      from_user         to_user  cultural_similarity
0  -1kingkrool-        web-dude             0.604204
1      -banned-  stimulatedecho             0.198554
2        -domi-       --tenet--             0.538471


this dataset has 7478 user pairs (the cultural similarity was found for 7952 comments, however there were 474 repeated user pairs in that)

Confirm that there are no repeated user pairs

In [None]:
data_culsim_user_pairs.duplicated(subset=['from_user', 'to_user']).sum()

0

In [None]:
data_culsim_user_pairs.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_user_pair_level_culsim.csv')