# **Generating the Word Embeddings for users across Fifteen Subreddits**

This notebook considers 15 subreddits. For each user in a subreddit, it calculates the mean word embedding from all his comments. Then it calculates it generates the cosine similarity of language use between user pairs.

This notebook uses the SentenceBert.


---

**Part 1: Reading the data**<br>
**Part 2: Generate the word embeddings** <br>
  Generate word embedding for every commnent, and then take the mean for every user. This embedding represents his language usage on the reddit platform.<br>
**Part 3: Find Cultural Similarity between user pairs**<br>
  For every user pair, find the cosine similarity between their word embeddings<br>

---

.


.


---
# **Part 1: Reading the data**

In this section, I have read the 15 subreddits.

---
.

Check if cuda is being used

In [1]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cuda.


Connect to drive

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import pandas as pd
data_fifteen_subreddits = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits.csv', low_memory=False)
print(len(data_fifteen_subreddits)) #length of data = 107352
print(len(pd.unique(data_fifteen_subreddits['subreddit_id']))) #number of subreddits considered = 16 #but is 15
print(len(pd.unique(data_fifteen_subreddits['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data_fifteen_subreddits['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_fifteen_subreddits['link_id']))) #number of submissions =
print(len(data_fifteen_subreddits.columns))
print(pd.unique(data_fifteen_subreddits['subreddit_id'])) #there are fifteen unique subreddits, the nana rows are ignored

107352
16
107351
50325
6156
17
['t5_22i0' 't5_2r2jt' 't5_3deqz' 't5_2sjgc' 't5_2scss' 't5_2r4oc'
 't5_2wm0g' 't5_2qmpb' 't5_2vbli' 't5_2qhwp' 't5_2qh33' 't5_2ror6'
 't5_2sgoq' 't5_2qm35' 't5_2qo4s' nan]


---
# **Part 2: Generate the word embeddings**

In this section, I have read the 15 subreddits. Generate word embedding for every commnent, and then take the mean for every user. This embedding represents his language usage on the reddit platform.

---
.

Import libraries

In [6]:
!pip install -U sentence-transformers
# load tqdm
!pip install --force https://github.com/chengs/tqdm/archive/colab.zip
!pip install swifter

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m96.0 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 k

Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tqdm>=4.33.0 (from swifter)
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.4.0-py3-none-any.whl size=16507 sha256=a9e93d46f2ce67bd5495b2c791d3a31685ee5deb53c29f09e577b6a9573f53f3
  Stored in directory: /root/.cache/pip/wheels/e4/cf/51/0904952972ee2c7aa3709437065278dc534ec1b8d2ad41b443
Successfully built swifter
Installing collected packages: tqdm, swifter
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.28.1
    Uninstalling tqdm-4.28.1:
     

In [7]:
import numpy as np
import pandas as pd
import nltk
from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses, util
from torch.utils.data import DataLoader
from sentence_transformers import losses
import os
import swifter
from nltk.tokenize import sent_tokenize
import torch
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

print data

In [39]:
print(len(data_fifteen_subreddits))
data_fifteen_subreddits.head(3)

107352


Unnamed: 0,edited,id,parent_id,distinguished,created_utc,author_flair_text,author_flair_css_class,controversiality,subreddit_id,retrieved_on,link_id,author,score,gilded,stickied,body,subreddit
0,0,dbumnpz,t1_dbulzrw,,1483229000.0,,NYAN,0.0,t5_22i0,1485680000.0,t3_5lc6zb,captnkaposzta,2.0,0.0,False,Beileid? Kiwi Fernsehgarten Trinkspiele retten...,de
1,0,dbumnq0,t1_dbum9w2,,1483229000.0,,,0.0,t5_2r2jt,1485680000.0,t3_5lai4x,CampyJejuni,3.0,0.0,False,Wrong subreddit mate.,TwoXChromosomes
2,0,dbumnq1,t3_5lb9zs,,1483229000.0,,,0.0,t5_3deqz,1485680000.0,t3_5lb9zs,Luigimario280,7.0,0.0,False,Karma!,BikiniBottomTwitter


clean up the data

First take only the 'author' and 'body' columns

In [40]:
data_word_embeddings = data_fifteen_subreddits[['author','body']]
print(len(data_word_embeddings)) #length of data = 107352
print(len(pd.unique(data_word_embeddings['author']))) #number of authors = 39443
print(data_word_embeddings.columns) #only two columns

107352
39443
Index(['author', 'body'], dtype='object')


drop comments with missing body

In [41]:
data_word_embeddings = data_word_embeddings.dropna(subset=['body'])
print(len(data_word_embeddings))
data_word_embeddings.head(3)

107345


Unnamed: 0,author,body
0,captnkaposzta,Beileid? Kiwi Fernsehgarten Trinkspiele retten...
1,CampyJejuni,Wrong subreddit mate.
2,Luigimario280,Karma!


delete those comments which have body as '[deleted]'. We see that there are 5849 such rows.

In [42]:
data_word_embeddings[data_word_embeddings['body'] == '[deleted]']

Unnamed: 0,author,body
6,[deleted],[deleted]
19,[deleted],[deleted]
22,[deleted],[deleted]
41,[deleted],[deleted]
53,[deleted],[deleted]
...,...,...
107292,[deleted],[deleted]
107313,[deleted],[deleted]
107333,[deleted],[deleted]
107340,[deleted],[deleted]


In [43]:
data_word_embeddings = data_word_embeddings[data_word_embeddings['body'] != '[deleted]']
print(len(data_word_embeddings))
print(len(pd.unique(data_word_embeddings['author']))) #number of authors = 39440
print(data_word_embeddings.head(3))

101496
39440
          author                                               body
0  captnkaposzta  Beileid? Kiwi Fernsehgarten Trinkspiele retten...
1    CampyJejuni                              Wrong subreddit mate.
2  Luigimario280                                             Karma!


using a pretrained SBERT model to encode the sentences rather than training on the reddit data itself

In [37]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Took around 20 min on a A100 gpu

In [44]:
embeddings_list = []
for ind, row in data_word_embeddings.iterrows():

  if ind % 5000 == 0:
    print('created embedding for '+str(ind)+'/'+str(len(data_word_embeddings))+' comments')

  curr_comment = row['body']

  sentence_embedding = sbert_model.encode(curr_comment)
  embeddings_list.append(sentence_embedding)

data_word_embeddings['word_embedding'] = embeddings_list

print(len(data_word_embeddings))
print(data_word_embeddings.head(3))

created embedding for 0/101496 comments
created embedding for 5000/101496 comments
created embedding for 10000/101496 comments
created embedding for 15000/101496 comments
created embedding for 20000/101496 comments
created embedding for 25000/101496 comments
created embedding for 30000/101496 comments
created embedding for 35000/101496 comments
created embedding for 40000/101496 comments
created embedding for 45000/101496 comments
created embedding for 50000/101496 comments
created embedding for 55000/101496 comments
created embedding for 60000/101496 comments
created embedding for 65000/101496 comments
created embedding for 70000/101496 comments
created embedding for 75000/101496 comments
created embedding for 80000/101496 comments
created embedding for 85000/101496 comments
created embedding for 90000/101496 comments
created embedding for 95000/101496 comments
created embedding for 105000/101496 comments
101496
          author                                               body  \
0 

note: the index above goes till 107 thousand, but there are only 101496 comments considered

In [46]:
#print the length of the first row embedding
print(data_word_embeddings['word_embedding'][0])
print(len(data_word_embeddings['word_embedding'][0])) #length is 768

[-1.15764081e-01  2.40050897e-01  1.36881459e+00  3.39362681e-01
  3.56814146e-01 -9.80027795e-01  3.46929073e-01 -1.79646850e-01
  1.24144502e-01  3.02083701e-01  4.69356962e-02  5.89588583e-01
  8.67583811e-01  7.54454255e-01 -6.22919917e-01  8.19891870e-01
 -5.97440243e-01 -3.38363834e-02  2.53817320e-01 -7.33135223e-01
 -7.11114824e-01  7.29612634e-02 -6.93723798e-01  5.34910738e-01
  3.64424020e-01  4.11940426e-01  7.09144890e-01 -9.83539581e-01
 -5.12035303e-02  3.19225162e-01  1.38481751e-01  3.30527201e-02
 -1.72940493e-01 -2.40447208e-01  5.21433987e-02  5.93059242e-01
 -3.85586530e-01 -2.62648202e-02  2.96835244e-01  2.36109659e-01
  3.00191343e-01 -2.52287298e-01  6.64876580e-01  9.97578949e-02
 -1.42884469e+00 -1.54249549e-01  8.51189271e-02  7.12770104e-01
  3.02345544e-01 -9.42264140e-01 -7.66531825e-02 -2.79888779e-01
  7.92343438e-01  4.16219443e-01 -2.00807959e-01 -1.14711642e-01
  3.99682075e-01 -8.23449969e-01  1.78925782e-01 -3.82849693e-01
 -6.73233390e-01 -1.08337

Save the word embeddings for every author in a csv

Aggregate based on authors to have a single embedding for every author

In [51]:
#number of unique authors
print(len(data_word_embeddings)) #length of data embeddings data
print(len(pd.unique(data_word_embeddings['author']))) #number of authors = 39440

101496
39440


In [52]:
author_embeddings = data_word_embeddings.groupby(['author'], as_index=False)['word_embedding'].mean()
print(len(author_embeddings)) #length of data embeddings data
print(len(pd.unique(author_embeddings['author']))) #number of authors = 39440

39440
39440


In [54]:
print(len(author_embeddings))
print(author_embeddings.head(3))

39440
                 author                                     word_embedding
0                --AJ--  [0.10850837, 0.59992176, 1.7410517, 0.06650150...
1       --Hello_World--  [-0.23539007, -0.056756258, 1.4918262, 0.14055...
2  --IIII--------IIII--  [0.34247348, 1.1148782, 0.71182245, -0.3969451...


In [55]:
author_embeddings.to_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/word_embeddings_fifteen_subreddits.csv')

---
# **Part 3: Find Cultural Similarity between user pairs**

For every user pair, find the cosine similarity between their word embeddings

---
.

In [67]:
cul_sim_results = pd.DataFrame(columns=['subreddit_id','id','from_user','to_user','cultural_similarity'])
print(cul_sim_results)

Empty DataFrame
Columns: [subreddit_id, id, from_user, to_user, cultural_similarity]
Index: []


function to find cosine similarity

In [62]:
import numpy as np
def cosine_sim(vector1, vector2):
    return min(1., np.dot(vector1, vector2) / (np.linalg.norm(vector1, ord=2) * np.linalg.norm(vector2, ord=2)))

In [68]:
def cultural_similarity_function(input_data):

  ignore_comments_counter = 0
  j = 0

  #additional code to resolve an error
  type_base = type(input_data['parent_id'].iloc[0])

  for ind, row in input_data.iterrows():
    j += 1
    if j % 10000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    curr_author = row['author']
    curr_subreddit_id = row['subreddit_id']
    curr_id = row['id']
    if type(row['parent_id']) != type_base:
      ignore_comments_counter
      continue
    curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters

    #find the parent comment
    #identify if a comment is a primary comment
    primary_comment_flag = 0
    if row['parent_id'] == row['link_id']: #it is a primary comment
      #print("primary comment")
      primary_comment_flag = 1
    if(len(input_data[input_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
      ignore_comments_counter += 1
      continue
    primary_comment_flag = 0
    curr_parent = input_data[input_data['id'] == curr_parent_comment_id]['author'].values[0]

    #find the word embeddings of sender and reciver and hence cosine similarity
    if len(author_embeddings[author_embeddings['author'] == curr_author]['word_embedding'].values) == 0: #could not find the embedding of the sender
      ignore_comments_counter += 1
      continue
    else:
      from_user_embedding = author_embeddings[author_embeddings['author'] == curr_author]['word_embedding'].values[0]

    if len(author_embeddings[author_embeddings['author'] == curr_parent]['word_embedding'].values) == 0: #could not find the embedding of the reciver
      ignore_comments_counter += 1
      continue
    else:
      to_user_embedding = author_embeddings[author_embeddings['author'] == curr_parent]['word_embedding'].values[0]

    if j == 0:
      print(from_user_embedding.shape)
      print(to_user_embedding.shape)
    cultsim = cosine_sim(from_user_embedding, to_user_embedding)

    #dont add the user pair if it was already encountered
    if len(cul_sim_results[(cul_sim_results['from_user'] == curr_author) & (cul_sim_results['to_user'] == curr_parent)].values) > 0: #there exists a row with the curr_author to curr_parent already so we have already found the cosine similarity between embeddings
      ignore_comments_counter += 1
      continue
    else:
      cul_sim_results.loc[len(cul_sim_results.index)] = [curr_subreddit_id, curr_id, curr_author, curr_parent, cultsim]

  print('total number of comments ignored: ' +str(ignore_comments_counter))
  return cul_sim_results

data_cultural_similarity = cultural_similarity_function(data_fifteen_subreddits)

finished comment 10000/107352
finished comment 20000/107352
finished comment 30000/107352
finished comment 40000/107352
finished comment 50000/107352
finished comment 60000/107352
finished comment 70000/107352
finished comment 80000/107352
finished comment 90000/107352
finished comment 100000/107352
total number of comments ignored: 54022


In [69]:
print(len(data_cultural_similarity))
print(data_cultural_similarity.head(3))

53325
  subreddit_id       id  from_user         to_user  cultural_similarity
0     t5_2qo4s  dbumpbm   Exoguana   passiveparrot             0.767708
1     t5_2qo4s  dbumphm    Stankie        yungtito             0.770331
2     t5_2qh33  dbumq02  [deleted]  BrightenedGold             0.497467


Confirm that there are no repeated user pairs

In [70]:
data_cultural_similarity.duplicated(subset=['from_user', 'to_user']).sum()

0

In [71]:
data_cultural_similarity.to_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_cultural_similarity.csv')