---

### Additional testing

This notebook performs extra processing and tests on the askscience data<br>

The following data file 'data_result_askscience_subreddits_aggressive_language.csv' was modified to add the next two steps<br>


Part 1:   add the comment level cultural similarity (which has not yet been averaged across user_pairs)
Part 2:   add a new dummy column 'netSim_Dummy' which takes value of 1 for high network similarity comments and value of 0 for low network similarity comments.

---

.

.

---


## **Part 1:   add the comment level cultural similarity (which has not yet been averaged across user_pairs)**

---

Check if cuda is being used

In [40]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cpu.


Connect to drive

In [41]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


read the ouput file from 'Aggressive_Language_askscience.ipynb' which is 'data_result_askscience_subreddits_aggressive_language.csv'.

In [42]:
import pandas as pd
import numpy as np
data = pd.read_csv("/content/gdrive/MyDrive/Colab Notebooks/askscience/data_result_askscience_subreddits_aggressive_language.csv", index_col=0)

In [43]:
print(len(pd.unique(data['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data['parent_id']))) #number of parent nodes =
print(len(pd.unique(data['link_id']))) #number of submissions =
print(len(pd.unique(data['author']))) #number of submissions =
print(len(data.columns))

1
7952
4612
368
5002
23


In [44]:
data.head(3)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,parent_id,link_id,retrieved_on,...,cultural_similarity,parent_comment_author,insult_prob,toxicity_prob,threat_prob,date_time,date,time,date_hour,date_hour_min
0,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,iqkee0k,xs73nx,1664960533,...,0.318494,chop1n,0.000171,0.000725,0.000118,2022-10-01_00:09:02,2022-10-01,00:09:02,2022-10-01_00,2022-10-01_00:09
3,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,iqke0xc,xs1k1y,1664960508,...,0.642043,feitingen,0.000198,0.000734,0.000142,2022-10-01_00:16:00,2022-10-01,00:16:00,2022-10-01_00,2022-10-01_00:16
4,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,iqker6l,xs73nx,1664960507,...,0.421561,omniskeptic,0.000175,0.000864,0.000121,2022-10-01_00:16:18,2022-10-01,00:16:18,2022-10-01_00,2022-10-01_00:16


now read the file 'data_askscience_comment_level_culsim.csv' generated from 'Word_Embeddings_For_askscience_Subreddits.ipynb' which contains the cultural similarity for each comment but not averaged across user pairs. Thus this way the cultural similarity will be kept at the level of comments.

In [45]:
culsim_commentLevel = pd.read_csv("/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_comment_level_culsim.csv", index_col=0)

In [46]:
print(len(pd.unique(culsim_commentLevel['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(culsim_commentLevel['from_user']))) #number of parent nodes =
print(len(pd.unique(culsim_commentLevel['to_user']))) #number of submissions =
print(len(culsim_commentLevel.columns))

7952
5002
3265
4


In [47]:
culsim_commentLevel.head(3)

Unnamed: 0,id,from_user,to_user,cultural_similarity
0,iqker6l,omniskeptic,chop1n,0.432118
1,iqkfl8j,jeweledjuniper,feitingen,0.642043
2,iqkfmj9,chop1n,omniskeptic,0.421561


now we need to map each of the 7952 comments in 'data_result_askscience_subreddits_aggressive_language.csv' with the comment level cultural similarity.

In [48]:
def map_commentLevel_culSim(input_data, culturalSimilarity_commentlevel_data):

  input_data['culturalSimilarity_Com'] = np.nan
  j = 0

  for ind, row in input_data.iterrows():

    j += 1
    if j % 1000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    cur_author = row['author']
    cur_parent = row['parent_comment_author']
    cur_comment_id = row['id']


    culsim_comLevel = culturalSimilarity_commentlevel_data[(culturalSimilarity_commentlevel_data['from_user'] == cur_author) & (culturalSimilarity_commentlevel_data['to_user'] == cur_parent) & (culturalSimilarity_commentlevel_data['id'] == cur_comment_id)]['cultural_similarity'].values[0]
    #print(culsim_comLevel)
    input_data.at[ind,'culturalSimilarity_Com'] = culsim_comLevel

  return input_data

In [49]:
data1 = map_commentLevel_culSim(data,culsim_commentLevel)
print(len(data1)) #length of data
print(len(pd.unique(data1['subreddit']))) #number of subreddits considered =
print(len(pd.unique(data1['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data1['author']))) #number of author =
print(len(pd.unique(data1['parent_id']))) #number of parent nodes =
print(len(pd.unique(data1['link_id']))) #number of submissions =
print(len(data1[['author', 'parent_comment_author']].value_counts())) #number of unique counts of speaker-receiver pairs
print(len(data1.columns))

finished comment 1000/7952
finished comment 2000/7952
finished comment 3000/7952
finished comment 4000/7952
finished comment 5000/7952
finished comment 6000/7952
finished comment 7000/7952
7952
1
7952
5002
4612
368
7478
24


In [51]:
data1.head(10)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,parent_id,link_id,retrieved_on,...,parent_comment_author,insult_prob,toxicity_prob,threat_prob,date_time,date,time,date_hour,date_hour_min,culturalSimilarity_Com
0,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,iqkee0k,xs73nx,1664960533,...,chop1n,0.000171,0.000725,0.000118,2022-10-01_00:09:02,2022-10-01,00:09:02,2022-10-01_00,2022-10-01_00:09,0.432118
3,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,iqke0xc,xs1k1y,1664960508,...,feitingen,0.000198,0.000734,0.000142,2022-10-01_00:16:00,2022-10-01,00:16:00,2022-10-01_00,2022-10-01_00:16,0.642043
4,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,iqker6l,xs73nx,1664960507,...,omniskeptic,0.000175,0.000864,0.000121,2022-10-01_00:16:18,2022-10-01,00:16:18,2022-10-01_00,2022-10-01_00:16,0.421561
38,iqkrd5j,askscience,Thats also what I remember. There was speculat...,greese007,2,0,1664589335,iqklvbl,xs4rhf,1664960145,...,blscratch,0.000195,0.000915,0.000128,2022-10-01_01:55:35,2022-10-01,01:55:35,2022-10-01_01,2022-10-01_01:55,0.20336
39,iqkre4b,askscience,"Not sure if you’re writing only about insects,...",viciousfishous08,34,0,1664589349,iqke7g3,xs9pjy,1664960143,...,thelogicalghost,0.000226,0.00161,0.000116,2022-10-01_01:55:49,2022-10-01,01:55:49,2022-10-01_01,2022-10-01_01:55,0.032065
52,iqkt0ym,askscience,"Ahaha, so, short version is, its a scifi story...",thelogicalghost,34,0,1664590185,iqkre4b,xs9pjy,1664960094,...,viciousfishous08,0.00018,0.000896,0.000108,2022-10-01_02:09:45,2022-10-01,02:09:45,2022-10-01_02,2022-10-01_02:09,0.236366
56,iqkww4r,askscience,Also your hearing recognizes the tones as cert...,tin_man6328,4,0,1664592247,iqkfsgy,xs73nx,1664959974,...,moewind420,0.000179,0.000659,0.000124,2022-10-01_02:44:07,2022-10-01,02:44:07,2022-10-01_02,2022-10-01_02:44,0.430613
67,iqkyg8o,askscience,This is why you feel blinded when youre drivin...,yeswehavenotomatoes,34,0,1664593112,iqke5ya,xs73nx,1664959927,...,balazer,0.001518,0.027269,0.0004,2022-10-01_02:58:32,2022-10-01,02:58:32,2022-10-01_02,2022-10-01_02:58,0.239771
76,iqkz2yw,askscience,this sounds very interesting! would love to se...,mib_sum1ls,13,0,1664593459,iqkt0ym,xs9pjy,1664959907,...,thelogicalghost,0.000253,0.001617,0.000111,2022-10-01_03:04:19,2022-10-01,03:04:19,2022-10-01_03,2022-10-01_03:04,0.310097
86,iql02a4,askscience,Give me multiverse version of Starship Trooper...,glomgore,4,0,1664594017,iqkz2yw,xs9pjy,1664959877,...,mib_sum1ls,0.000183,0.000962,0.000116,2022-10-01_03:13:37,2022-10-01,03:13:37,2022-10-01_03,2022-10-01_03:13,0.320632


confirming that the date time conversions were made properly

In [53]:
print(len(pd.unique(data['date_time'])))
print(pd.unique(data['date_time']))
print(len(pd.unique(data['date'])))
print(pd.unique(data['date']))
print(len(pd.unique(data['time'])))
print(pd.unique(data['time']))
print(len(pd.unique(data['date_hour'])))
print(pd.unique(data['date_hour']))
print(len(pd.unique(data['date_hour_min'])))
print(pd.unique(data['date_hour_min']))

7927
['2022-10-01_00:09:02' '2022-10-01_00:16:00' '2022-10-01_00:16:18' ...
 '2022-10-31_23:25:52' '2022-10-31_23:50:29' '2022-10-31_23:42:20']
31
['2022-10-01' '2022-10-30' '2022-10-29' '2022-10-02' '2022-10-03'
 '2022-10-04' '2022-10-05' '2022-10-06' '2022-10-31' '2022-10-07'
 '2022-10-08' '2022-10-12' '2022-10-09' '2022-10-10' '2022-10-11'
 '2022-10-13' '2022-10-14' '2022-10-15' '2022-10-16' '2022-10-17'
 '2022-10-18' '2022-10-19' '2022-10-20' '2022-10-21' '2022-10-22'
 '2022-10-23' '2022-10-24' '2022-10-25' '2022-10-26' '2022-10-27'
 '2022-10-28']
7566
['00:09:02' '00:16:00' '00:16:18' ... '23:25:52' '23:50:29' '23:42:20']
720
['2022-10-01_00' '2022-10-01_01' '2022-10-01_02' '2022-10-01_03'
 '2022-10-01_04' '2022-10-01_05' '2022-10-01_07' '2022-10-01_08'
 '2022-10-01_09' '2022-10-01_10' '2022-10-01_11' '2022-10-01_12'
 '2022-10-30_15' '2022-10-01_14' '2022-10-01_15' '2022-10-29_18'
 '2022-10-01_16' '2022-10-01_17' '2022-10-01_18' '2022-10-01_19'
 '2022-10-01_20' '2022-10-01_21' '20

check that all comments have a 'culturalSimilarity_Com'

In [54]:
print(data1['culturalSimilarity_Com'].isna().sum())

0


---


## **Part 2:   add a new dummy column 'netSim_Dummy' which takes value of 1 for high network similarity comments and value of 0 for low network similarity comments.**

---

In [55]:
data1.describe()

Unnamed: 0,score,gilded,created_utc,retrieved_on,controversiality,network_similarity,cultural_similarity,insult_prob,toxicity_prob,threat_prob,culturalSimilarity_Com
count,7952.0,7952.0,7952.0,7952.0,7952.0,7952.0,7952.0,7952.0,7952.0,7952.0,7952.0
mean,20.791122,0.001258,1665899000.0,1667917000.0,0.022887,0.921399,0.469277,0.003137,0.019403,0.000793,0.469277
std,117.879782,0.035442,753758.4,244160.0,0.149554,0.106205,0.193026,0.030717,0.093551,0.013706,0.198451
min,-183.0,0.0,1664583000.0,1664958000.0,0.0,-0.221962,-0.139406,0.000164,0.000498,8.3e-05,-0.139406
25%,1.0,0.0,1665248000.0,1667898000.0,0.0,0.9028,0.349343,0.000179,0.000658,0.000117,0.346653
50%,3.0,0.0,1665880000.0,1667939000.0,0.0,0.958642,0.490906,0.000185,0.000825,0.000124,0.493054
75%,12.0,0.0,1666575000.0,1667976000.0,0.0,0.981896,0.608614,0.000231,0.001949,0.000135,0.613215
max,7274.0,1.0,1667260000.0,1668012000.0,1.0,1.0,0.968989,0.843692,0.996931,0.619349,0.968989


the median value for networkSimilarity = 0.958642

In [61]:
def map_comment_networkDummy(input_data):

  input_data['networkSimilarity_Dummy'] = np.nan
  j = 0

  for ind, row in input_data.iterrows():

    j += 1
    if j % 1000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    cur_network_similarity = row['network_similarity']
    if cur_network_similarity < 0.958642:
      val = 0
    else:
      val = 1
    input_data.at[ind,'networkSimilarity_Dummy'] = val

  return input_data

In [62]:
data2 = map_comment_networkDummy(data1)
print(len(data2)) #length of data
print(len(pd.unique(data2['subreddit']))) #number of subreddits considered =
print(len(pd.unique(data2['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data2['author']))) #number of author =
print(len(pd.unique(data2['parent_id']))) #number of parent nodes =
print(len(pd.unique(data2['link_id']))) #number of submissions =
print(len(data2[['author', 'parent_comment_author']].value_counts())) #number of unique counts of speaker-receiver pairs
print(len(data2.columns))

finished comment 1000/7952
finished comment 2000/7952
finished comment 3000/7952
finished comment 4000/7952
finished comment 5000/7952
finished comment 6000/7952
finished comment 7000/7952
7952
1
7952
5002
4612
368
7478
25


In [63]:
data2.head(10)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,parent_id,link_id,retrieved_on,...,insult_prob,toxicity_prob,threat_prob,date_time,date,time,date_hour,date_hour_min,culturalSimilarity_Com,networkSimilarity_Dummy
0,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,iqkee0k,xs73nx,1664960533,...,0.000171,0.000725,0.000118,2022-10-01_00:09:02,2022-10-01,00:09:02,2022-10-01_00,2022-10-01_00:09,0.432118,1.0
3,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,iqke0xc,xs1k1y,1664960508,...,0.000198,0.000734,0.000142,2022-10-01_00:16:00,2022-10-01,00:16:00,2022-10-01_00,2022-10-01_00:16,0.642043,0.0
4,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,iqker6l,xs73nx,1664960507,...,0.000175,0.000864,0.000121,2022-10-01_00:16:18,2022-10-01,00:16:18,2022-10-01_00,2022-10-01_00:16,0.421561,1.0
38,iqkrd5j,askscience,Thats also what I remember. There was speculat...,greese007,2,0,1664589335,iqklvbl,xs4rhf,1664960145,...,0.000195,0.000915,0.000128,2022-10-01_01:55:35,2022-10-01,01:55:35,2022-10-01_01,2022-10-01_01:55,0.20336,0.0
39,iqkre4b,askscience,"Not sure if you’re writing only about insects,...",viciousfishous08,34,0,1664589349,iqke7g3,xs9pjy,1664960143,...,0.000226,0.00161,0.000116,2022-10-01_01:55:49,2022-10-01,01:55:49,2022-10-01_01,2022-10-01_01:55,0.032065,1.0
52,iqkt0ym,askscience,"Ahaha, so, short version is, its a scifi story...",thelogicalghost,34,0,1664590185,iqkre4b,xs9pjy,1664960094,...,0.00018,0.000896,0.000108,2022-10-01_02:09:45,2022-10-01,02:09:45,2022-10-01_02,2022-10-01_02:09,0.236366,1.0
56,iqkww4r,askscience,Also your hearing recognizes the tones as cert...,tin_man6328,4,0,1664592247,iqkfsgy,xs73nx,1664959974,...,0.000179,0.000659,0.000124,2022-10-01_02:44:07,2022-10-01,02:44:07,2022-10-01_02,2022-10-01_02:44,0.430613,0.0
67,iqkyg8o,askscience,This is why you feel blinded when youre drivin...,yeswehavenotomatoes,34,0,1664593112,iqke5ya,xs73nx,1664959927,...,0.001518,0.027269,0.0004,2022-10-01_02:58:32,2022-10-01,02:58:32,2022-10-01_02,2022-10-01_02:58,0.239771,1.0
76,iqkz2yw,askscience,this sounds very interesting! would love to se...,mib_sum1ls,13,0,1664593459,iqkt0ym,xs9pjy,1664959907,...,0.000253,0.001617,0.000111,2022-10-01_03:04:19,2022-10-01,03:04:19,2022-10-01_03,2022-10-01_03:04,0.310097,1.0
86,iql02a4,askscience,Give me multiverse version of Starship Trooper...,glomgore,4,0,1664594017,iqkz2yw,xs9pjy,1664959877,...,0.000183,0.000962,0.000116,2022-10-01_03:13:37,2022-10-01,03:13:37,2022-10-01_03,2022-10-01_03:13,0.320632,0.0


check that there are no missing values in the new columns

In [65]:
print(data2['networkSimilarity_Dummy'].isna().sum())

0


check how many rows have value 1

In [66]:
print(data2['networkSimilarity_Dummy'].sum())

3976.0


In [67]:
data2.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_regressions.csv')