
---
# Mapping Every Comment to a Network and Cultural Similarity Measure

This python notebook maps each comment to a cosin similarity value between the node embedding of the sender and the receiver of the comment. The unique user pairs between these senders and receivers are in the output file 'data_fifteen_subreddits_method2c_node2vec' obtained from the 'Node_Embeddings_For_Fifteen_Subreddits.ipynb' notebook. The notebook also maps each comment to a parent-comment-author. Thus, each comment has been mapped to a network similarity, cultural similarity, and a parent comment author.

OUTPUT FILES:<br>
1. 'similarity_fifteen_subreddits.csv' : which contains all the comments of the 15 subreddits, only some of which have a network similarity and cultural similarity measure
2. 'data_fifteen_subreddits_similarity.csv' : which contains only the comments which have both a network similarity measure and a cultural similarity measure

---

.

.

In [None]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cuda.


In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


---
##### Output files of 'Node_Embeddings_For_Fifteen_Subreddits.ipynb' for other methods (2a and 2b)
---

In [None]:
import pandas as pd
data_fifteen_subreddits_method2a_node2vec = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_method2a_node2vec.csv', low_memory=False)
print(len(data_fifteen_subreddits_method2a_node2vec))
print(len(pd.unique(data_fifteen_subreddits_method2a_node2vec['from_user'])))
print(len(pd.unique(data_fifteen_subreddits_method2a_node2vec['to_user'])))
print(len(data_fifteen_subreddits_method2a_node2vec.columns))
print(data_fifteen_subreddits_method2a_node2vec.head(3))

54469
24280
18536
5
  submission_id     from_user          to_user  edgeweight_method2a  \
0     t3_4y2tcs   Harden-Soul          cabbeer                  1.0   
1     t3_4y779v  General_Fear  the_strasburger                  1.0   
2     t3_4ys9yd        ashaw7    victor_knight                  1.0   

   cosine_similarity  
0           0.824062  
1           0.931396  
2           0.796708  


In [None]:
import pandas as pd
data_fifteen_subreddits_method2b_node2vec = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_method2b_node2vec.csv', low_memory=False)
print(len(data_fifteen_subreddits_method2b_node2vec))
print(len(pd.unique(data_fifteen_subreddits_method2b_node2vec['from_user'])))
print(len(pd.unique(data_fifteen_subreddits_method2b_node2vec['to_user'])))
print(len(data_fifteen_subreddits_method2b_node2vec.columns))
print(data_fifteen_subreddits_method2b_node2vec.head(3))

53326
24280
18536
5
  subreddit_id  from_user        to_user  edgeweight_method2b  \
0      t5_22i0  -Calidro-        niedrig             1.000000   
1      t5_22i0       -KR-         beerde             1.000000   
2      t5_22i0       -to-  boilersuthere             0.333333   

   cosine_similarity  
0           0.980658  
1           0.939030  
2           0.978517  


---
# Map each comment to a network similarity and the cultural similarity<br>
###### ( network similarity = the cosine similarity between node embeddings of the parent node and the sender node of this comment)<br>
######( cultural similarity = the average cosine similarity between node embeddings of the parent node and the sender node of this comment)

---

read the output file which has the network similarity for unique user pairs

In [None]:
import pandas as pd
import numpy as np
data_fifteen_subreddits_method2c_node2vec = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_method2c_node2vec.csv', low_memory=False)
print(len(data_fifteen_subreddits_method2c_node2vec))
print(len(pd.unique(data_fifteen_subreddits_method2c_node2vec['from_user'])))
print(len(pd.unique(data_fifteen_subreddits_method2c_node2vec['to_user'])))
print(len(data_fifteen_subreddits_method2c_node2vec.columns))
print(data_fifteen_subreddits_method2c_node2vec.head(3))

53326
24280
18536
4
  from_user         to_user  edgeweight_method2c  cosine_similarity
0    --AJ--          --AJ--                 1.00           1.000000
1    --AJ--      Switch72nd                 0.25           0.984488
2   --Nylon  tastefulchrist                 1.00           0.911960


##### note: The column 'cosine_similarity' is actually the 'Network Similarity'

read the output file which has the cultural similarity for unique user pairs

In [None]:
data_fifteen_subreddits_user_pair_level_culsim = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_user_pair_level_culsim.csv', low_memory=False, index_col=0)
print(len(data_fifteen_subreddits_user_pair_level_culsim))
print(len(pd.unique(data_fifteen_subreddits_user_pair_level_culsim['from_user'])))
print(len(pd.unique(data_fifteen_subreddits_user_pair_level_culsim['to_user'])))
print(len(data_fifteen_subreddits_user_pair_level_culsim.columns))
print(data_fifteen_subreddits_user_pair_level_culsim.head(3))

49972
23718
18089
3
  from_user         to_user  cultural_similarity
0    --AJ--          --AJ--             0.150673
1    --AJ--      Switch72nd             0.505629
2   --Nylon  tastefulchrist             0.134080


this difference in the number of user pairs between both output files (53326 and 49972) is because while calculating cultural similarity, the comments with missing body, or body with the text '[deleted]' was dropped

Thus 53326 unique user pairs were found in the data of 15 subreddits. <br>

Function to map each comment to a network similarity and a cosine similarity (if a valid parent comment is found)

In [None]:
def similarity_mapper(input_data):

  input_data['network_similarity'] = np.nan
  input_data['cultural_similarity'] = np.nan
  input_data['parent_comment_author'] = np.nan

  ignore_comments_counter = 0
  j = 0

  #additional code to resolve an error
  type_base = type(input_data['parent_id'].iloc[0])

  for ind, row in input_data.iterrows():
    j += 1
    if j % 10000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    curr_author = row['author']
    curr_subreddit_id = row['subreddit_id']
    curr_id = row['id']
    if type(row['parent_id']) != type_base:
      ignore_comments_counter
      continue
    curr_parent_comment_id = row['parent_id'][3:] #noticed that the parent id is nothing but the comment id preceded by 3 characters

    #find the parent comment
    #identify if a comment is a primary comment
    primary_comment_flag = 0
    if row['parent_id'] == row['link_id']: #it is a primary comment
      #print("primary comment")
      primary_comment_flag = 1
    if(len(input_data[input_data['id'] == curr_parent_comment_id]['author']) == 0): #the parent comment could not be found
      ignore_comments_counter += 1
      continue
    #if the parent was found
    primary_comment_flag = 0
    curr_parent = input_data[input_data['id'] == curr_parent_comment_id]['author'].values[0]

    #add the parent
    input_data.at[ind,'parent_comment_author'] = curr_parent

    #find the network similarity for this comment from the user pairs data
    if (len(data_fifteen_subreddits_method2c_node2vec[(data_fifteen_subreddits_method2c_node2vec['from_user'] == curr_author) & (data_fifteen_subreddits_method2c_node2vec['to_user'] == curr_parent)]['cosine_similarity'].values) !=0 ):
      net_sim = data_fifteen_subreddits_method2c_node2vec[(data_fifteen_subreddits_method2c_node2vec['from_user'] == curr_author) & (data_fifteen_subreddits_method2c_node2vec['to_user'] == curr_parent)]['cosine_similarity'].values[0]
      input_data.at[ind,'network_similarity'] = net_sim
    else:
      continue #could not find the network similarity

    #find the cultural similarity for this comment from the user pairs data
    if (len(data_fifteen_subreddits_user_pair_level_culsim[(data_fifteen_subreddits_user_pair_level_culsim['from_user'] == curr_author) & (data_fifteen_subreddits_user_pair_level_culsim['to_user'] == curr_parent)]['cultural_similarity'].values) != 0):
      cul_sim = data_fifteen_subreddits_user_pair_level_culsim[(data_fifteen_subreddits_user_pair_level_culsim['from_user'] == curr_author) & (data_fifteen_subreddits_user_pair_level_culsim['to_user'] == curr_parent)]['cultural_similarity'].values[0]
      input_data.at[ind,'cultural_similarity'] = cul_sim
    else:
      continue #could not find the cultural similarity


  print('total number of comments ignored: ' +str(ignore_comments_counter))
  return input_data


read the 15 subreddits data and call the function to find the network similarity, cultural similarity, and the parent author for each comment

In [None]:
data_fifteen_subreddits = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits.csv', low_memory=False)
print(len(data_fifteen_subreddits)) #length of data = 107352
print(len(pd.unique(data_fifteen_subreddits['subreddit_id']))) #number of subreddits considered = 16 #but is 15
print(len(pd.unique(data_fifteen_subreddits['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data_fifteen_subreddits['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_fifteen_subreddits['link_id']))) #number of submissions =
print(len(data_fifteen_subreddits.columns))

107352
16
107351
50325
6156
17


In [None]:
similarity_fifteen_subreddits = similarity_mapper(data_fifteen_subreddits)
print(len(similarity_fifteen_subreddits)) #length of data = 107352
print(len(pd.unique(similarity_fifteen_subreddits['subreddit_id']))) #number of subreddits considered = 16 #but is 15
print(len(pd.unique(similarity_fifteen_subreddits['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(similarity_fifteen_subreddits['parent_id']))) #number of parent nodes =
print(len(pd.unique(similarity_fifteen_subreddits['link_id']))) #number of submissions =
print(len(similarity_fifteen_subreddits.columns))

finished comment 10000/107352
finished comment 20000/107352
finished comment 30000/107352
finished comment 40000/107352
finished comment 50000/107352
finished comment 60000/107352
finished comment 70000/107352
finished comment 80000/107352
finished comment 90000/107352
finished comment 100000/107352
total number of comments ignored: 44313
107352
16
107351
50325
6156
20


number of unique author-parent comment author values in resulting set

In [None]:
print(len(similarity_fifteen_subreddits[['author', 'parent_comment_author']].value_counts()))

53326


In [None]:
similarity_fifteen_subreddits.head(3)

Unnamed: 0,edited,id,parent_id,distinguished,created_utc,author_flair_text,author_flair_css_class,controversiality,subreddit_id,retrieved_on,link_id,author,score,gilded,stickied,body,subreddit,network_similarity,cultural_similarity,parent_comment_author
0,0,dbumnpz,t1_dbulzrw,,1483229000.0,,NYAN,0.0,t5_22i0,1485680000.0,t3_5lc6zb,captnkaposzta,2.0,0.0,False,Beileid? Kiwi Fernsehgarten Trinkspiele retten...,de,,,
1,0,dbumnq0,t1_dbum9w2,,1483229000.0,,,0.0,t5_2r2jt,1485680000.0,t3_5lai4x,CampyJejuni,3.0,0.0,False,Wrong subreddit mate.,TwoXChromosomes,,,
2,0,dbumnq1,t3_5lb9zs,,1483229000.0,,,0.0,t5_3deqz,1485680000.0,t3_5lb9zs,Luigimario280,7.0,0.0,False,Karma!,BikiniBottomTwitter,,,


In [None]:
print(similarity_fifteen_subreddits['network_similarity'].isna().sum())
print(similarity_fifteen_subreddits['cultural_similarity'].isna().sum()) #it empty for more of the records
print(similarity_fifteen_subreddits['parent_comment_author'].isna().sum())

44318
48478
44318


44318 - the number of records for a network similarity was not found<br>
48478 - the number of records for a cultural similarity was not found

save 'similarity_fifteen_subreddits' to a csv file, this is all the comments in the data set, but only some of them have network similarity and cultural similarity measures

In [None]:
similarity_fifteen_subreddits.to_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/similarity_fifteen_subreddits.csv')

In [6]:
import pandas as pd
similarity_fifteen_subreddits = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/similarity_fifteen_subreddits.csv', low_memory=False)
print(len(similarity_fifteen_subreddits))

107352


In [7]:
data_fifteen_subreddits_similarity = similarity_fifteen_subreddits[~similarity_fifteen_subreddits['network_similarity'].isna()]
print(len(data_fifteen_subreddits_similarity))

63034


In [8]:
data_fifteen_subreddits_similarity = data_fifteen_subreddits_similarity[~data_fifteen_subreddits_similarity['cultural_similarity'].isna()]
print(len(data_fifteen_subreddits_similarity))

58874


now check how many unique values of user-parent pairs are there

In [9]:
print(len(data_fifteen_subreddits_similarity[['author', 'parent_comment_author']].value_counts()))

49972


that means there are 58874-49972 =  8902 repeated pairs of user-parent

In [10]:
print(len(data_fifteen_subreddits_similarity)) #length of data = 107352
print(len(pd.unique(data_fifteen_subreddits_similarity['subreddit_id']))) #number of subreddits considered = 16 #but is 15
print(len(pd.unique(data_fifteen_subreddits_similarity['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data_fifteen_subreddits_similarity['parent_id']))) #number of parent nodes =
print(len(pd.unique(data_fifteen_subreddits_similarity['link_id']))) #number of submissions =
print(len(data_fifteen_subreddits_similarity.groupby(['author', 'parent_comment_author']).size().reset_index(name='Freq'))) #print the unique number of user pairs 58874-49972 = 8902
print(len(data_fifteen_subreddits_similarity.columns))
print(data_fifteen_subreddits_similarity.head(5))

58874
15
58874
38858
3017
49972
21
     Unnamed: 0 edited       id   parent_id distinguished   created_utc  \
64           64      0  dbumpbm  t1_dbumnvd           NaN  1.483229e+09   
70           70      0  dbumphm  t1_dbumo04           NaN  1.483229e+09   
96           96      0  dbumqp3  t1_dbumnzf           NaN  1.483229e+09   
98           98      0  dbumqt2  t1_dbumpac           NaN  1.483229e+09   
121         121      0  dbumri2  t1_dbumojs           NaN  1.483229e+09   

                      author_flair_text author_flair_css_class  \
64                              Wizards               Wizards3   
70                                Kings                 Kings1   
96   Er ist bisschen ein Otto geworden.                   SHOL   
98                                 Heat                   Heat   
121                                 NaN                    NaN   

     controversiality subreddit_id  ...    link_id        author score  \
64                0.0     t5_2qo4s  ...  t3

Confirm that these columns do not have any missing values

In [11]:
print(data_fifteen_subreddits_similarity['network_similarity'].isna().sum())
print(data_fifteen_subreddits_similarity['cultural_similarity'].isna().sum())
print(data_fifteen_subreddits_similarity['parent_comment_author'].isna().sum())
print(data_fifteen_subreddits_similarity['body'].isna().sum())

0
0
0
0


In [12]:
data_fifteen_subreddits_similarity.to_csv('/content/gdrive/MyDrive/Colab Notebooks/reddit_project/data_fifteen_subreddits_similarity.csv')