---

### Additional testing

This notebook performs extra processing and tests on the askscience data<br>

The following data file 'data_result_askscience_subreddits_aggressive_language.csv' was modified to add the next two steps<br>


Part 1:   add the comment level cultural similarity (which has not yet been averaged across user_pairs)
Part 2:   add a new dummy column 'netSim_Dummy' which takes value of 1 for high network similarity comments and value of 0 for low network similarity comments.<br>

output files:
1. 'data_askscience_regressions.csv': the file ready for regressions. This file conatains comments for which there was a


*   parent comment author
*   network similarity (this is averaged across user pairs)
*   cultural similarity (this is averaged across user pairs)
*   insult score
*   threat score
*   toxicity score
*   time stamp (date_time, date, time, date_hour, data_hour_min)
*   cultural similarity at the comment level
*   network similarity dummy





---

.

.

---


## **Part 1:   add the comment level cultural similarity (which has not yet been averaged across user_pairs)**

---

Check if cuda is being used

In [1]:
import torch
if torch.cuda.is_available():
    device_name = torch.device("cuda")
else:
    device_name = torch.device('cpu')
print("Using {}.".format(device_name))

Using cpu.


Connect to drive

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


read the ouput file from 'Aggressive_Language_askscience.ipynb' which is 'data_result_askscience_subreddits_aggressive_language.csv'.

In [3]:
import pandas as pd
import numpy as np
data = pd.read_csv("/content/gdrive/MyDrive/Colab Notebooks/askscience/data_result_askscience_subreddits_aggressive_language.csv", index_col=0)

In [4]:
print(len(pd.unique(data['subreddit']))) #number of subreddits considered = 1
print(len(pd.unique(data['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data['parent_id']))) #number of parent nodes =
print(len(pd.unique(data['link_id']))) #number of submissions =
print(len(pd.unique(data['author']))) #number of submissions =
print(len(data.columns))

1
11782
6725
2377
6063
28


In [5]:
data.head(3)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,...,cultural_similarity,parent_comment_author,insult_prob,toxicity_prob,threat_prob,date_time,date,time,date_hour,date_hour_min
5679,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,1664960533,0,0,...,0.318494,chop1n,0.000171,0.000725,0.000118,2022-10-01_00:09:02,2022-10-01,00:09:02,2022-10-01_00,2022-10-01_00:09
5681,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,1664960508,0,0,...,0.642043,feitingen,0.000198,0.000734,0.000142,2022-10-01_00:16:00,2022-10-01,00:16:00,2022-10-01_00,2022-10-01_00:16
5682,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,1664960507,0,0,...,0.421561,omniskeptic,0.000175,0.000864,0.000121,2022-10-01_00:16:18,2022-10-01,00:16:18,2022-10-01_00,2022-10-01_00:16


now read the file 'data_askscience_comment_level_culsim.csv' generated from 'Word_Embeddings_For_askscience_Subreddits.ipynb' which contains the cultural similarity for each comment but not averaged across user pairs. Thus this way the cultural similarity will be kept at the level of comments.

In [6]:
culsim_commentLevel = pd.read_csv("/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_comment_level_culsim.csv", index_col=0)

In [7]:
print(len(pd.unique(culsim_commentLevel['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(culsim_commentLevel['from_user']))) #number of parent nodes =
print(len(pd.unique(culsim_commentLevel['to_user']))) #number of submissions =
print(len(culsim_commentLevel.columns))

11782
6063
5068
4


In [8]:
culsim_commentLevel.head(3)

Unnamed: 0,id,from_user,to_user,cultural_similarity
0,iqker6l,omniskeptic,chop1n,0.432118
1,iqkfl8j,jeweledjuniper,feitingen,0.642043
2,iqkfmj9,chop1n,omniskeptic,0.421561


now we need to map each of the 7952 comments in 'data_result_askscience_subreddits_aggressive_language.csv' with the comment level cultural similarity.

In [9]:
def map_commentLevel_culSim(input_data, culturalSimilarity_commentlevel_data):

  input_data['culturalSimilarity_Com'] = np.nan
  j = 0

  for ind, row in input_data.iterrows():

    j += 1
    if j % 1000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    cur_author = row['author']
    cur_parent = row['parent_comment_author']
    cur_comment_id = row['id']


    culsim_comLevel = culturalSimilarity_commentlevel_data[(culturalSimilarity_commentlevel_data['from_user'] == cur_author) & (culturalSimilarity_commentlevel_data['to_user'] == cur_parent) & (culturalSimilarity_commentlevel_data['id'] == cur_comment_id)]['cultural_similarity'].values[0]
    #print(culsim_comLevel)
    input_data.at[ind,'culturalSimilarity_Com'] = culsim_comLevel

  return input_data

In [10]:
data1 = map_commentLevel_culSim(data,culsim_commentLevel)
print(len(data1)) #length of data
print(len(pd.unique(data1['subreddit']))) #number of subreddits considered =
print(len(pd.unique(data1['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data1['author']))) #number of author =
print(len(pd.unique(data1['parent_id']))) #number of parent nodes =
print(len(pd.unique(data1['link_id']))) #number of submissions =
print(len(data1[['author', 'parent_comment_author']].value_counts())) #number of unique counts of speaker-receiver pairs
print(len(data1.columns))

finished comment 1000/11782
finished comment 2000/11782
finished comment 3000/11782
finished comment 4000/11782
finished comment 5000/11782
finished comment 6000/11782
finished comment 7000/11782
finished comment 8000/11782
finished comment 9000/11782
finished comment 10000/11782
finished comment 11000/11782
11782
1
11782
6063
6725
2377
11100
29


In [11]:
data1.head(10)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,...,parent_comment_author,insult_prob,toxicity_prob,threat_prob,date_time,date,time,date_hour,date_hour_min,culturalSimilarity_Com
5679,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,1664960533,0,0,...,chop1n,0.000171,0.000725,0.000118,2022-10-01_00:09:02,2022-10-01,00:09:02,2022-10-01_00,2022-10-01_00:09,0.432118
5681,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,1664960508,0,0,...,feitingen,0.000198,0.000734,0.000142,2022-10-01_00:16:00,2022-10-01,00:16:00,2022-10-01_00,2022-10-01_00:16,0.642043
5682,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,1664960507,0,0,...,omniskeptic,0.000175,0.000864,0.000121,2022-10-01_00:16:18,2022-10-01,00:16:18,2022-10-01_00,2022-10-01_00:16,0.421561
5692,iqkpm44,askscience,"Thank you for your submission! Unfortunately, ...",askscience-modteam,1,0,1664588425,1664960199,0,0,...,bhjoduar,0.000167,0.000615,0.000124,2022-10-01_01:40:25,2022-10-01,01:40:25,2022-10-01_01,2022-10-01_01:40,0.021882
5701,iqkrd5j,askscience,Thats also what I remember. There was speculat...,greese007,2,0,1664589335,1664960145,0,0,...,blscratch,0.000195,0.000915,0.000128,2022-10-01_01:55:35,2022-10-01,01:55:35,2022-10-01_01,2022-10-01_01:55,0.20336
5702,iqkre4b,askscience,"Not sure if you’re writing only about insects,...",viciousfishous08,34,0,1664589349,1664960143,0,0,...,thelogicalghost,0.000226,0.00161,0.000116,2022-10-01_01:55:49,2022-10-01,01:55:49,2022-10-01_01,2022-10-01_01:55,0.032065
5706,iqkww4r,askscience,Also your hearing recognizes the tones as cert...,tin_man6328,4,0,1664592247,1664959974,0,0,...,moewind420,0.000179,0.000659,0.000124,2022-10-01_02:44:07,2022-10-01,02:44:07,2022-10-01_02,2022-10-01_02:44,0.430613
5709,iqkyg8o,askscience,This is why you feel blinded when youre drivin...,yeswehavenotomatoes,34,0,1664593112,1664959927,0,0,...,balazer,0.001518,0.027269,0.0004,2022-10-01_02:58:32,2022-10-01,02:58:32,2022-10-01_02,2022-10-01_02:58,0.239771
5713,iqkz2yw,askscience,this sounds very interesting! would love to se...,mib_sum1ls,13,0,1664593459,1664959907,0,0,...,thelogicalghost,0.000253,0.001617,0.000111,2022-10-01_03:04:19,2022-10-01,03:04:19,2022-10-01_03,2022-10-01_03:04,0.310097
5715,iql02a4,askscience,Give me multiverse version of Starship Trooper...,glomgore,4,0,1664594017,1664959877,0,0,...,mib_sum1ls,0.000183,0.000962,0.000116,2022-10-01_03:13:37,2022-10-01,03:13:37,2022-10-01_03,2022-10-01_03:13,0.320632


confirming that the date time conversions were made properly

In [12]:
print(len(pd.unique(data['date_time'])))
print(pd.unique(data['date_time']))
print(len(pd.unique(data['date'])))
print(pd.unique(data['date']))
print(len(pd.unique(data['time'])))
print(pd.unique(data['time']))
print(len(pd.unique(data['date_hour'])))
print(pd.unique(data['date_hour']))
print(len(pd.unique(data['date_hour_min'])))
print(pd.unique(data['date_hour_min']))

11747
['2022-10-01_00:09:02' '2022-10-01_00:16:00' '2022-10-01_00:16:18' ...
 '2022-10-31_19:04:16' '2022-10-31_23:42:20' '2022-10-31_23:42:59']
31
['2022-10-01' '2022-10-30' '2022-10-29' '2022-10-02' '2022-10-03'
 '2022-10-04' '2022-10-05' '2022-10-06' '2022-10-31' '2022-10-07'
 '2022-10-08' '2022-10-12' '2022-10-09' '2022-10-10' '2022-10-11'
 '2022-10-13' '2022-10-14' '2022-10-15' '2022-10-16' '2022-10-17'
 '2022-10-18' '2022-10-19' '2022-10-20' '2022-10-21' '2022-10-22'
 '2022-10-23' '2022-10-24' '2022-10-25' '2022-10-26' '2022-10-27'
 '2022-10-28']
10958
['00:09:02' '00:16:00' '00:16:18' ... '19:04:16' '23:42:20' '23:42:59']
742
['2022-10-01_00' '2022-10-01_01' '2022-10-01_02' '2022-10-01_03'
 '2022-10-01_04' '2022-10-01_05' '2022-10-01_07' '2022-10-01_06'
 '2022-10-01_08' '2022-10-01_09' '2022-10-01_10' '2022-10-01_11'
 '2022-10-01_12' '2022-10-30_15' '2022-10-01_13' '2022-10-01_14'
 '2022-10-01_15' '2022-10-29_18' '2022-10-01_16' '2022-10-01_17'
 '2022-10-01_18' '2022-10-01_19' '

check that all comments have a 'culturalSimilarity_Com'

In [13]:
print(data1['culturalSimilarity_Com'].isna().sum())

0


---


## **Part 2:   add a new dummy column 'netSim_Dummy' which takes value of 1 for high network similarity comments and value of 0 for low network similarity comments.**

---

In [14]:
data1.describe()

Unnamed: 0,score,gilded,created_utc,retreived_on,permalink,num_comments,url,self_text,is_self,controversiality,network_similarity,cultural_similarity,insult_prob,toxicity_prob,threat_prob,culturalSimilarity_Com
count,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0,11782.0
mean,27.581141,0.002037,1665898000.0,1667917000.0,0.0,0.0,0.0,0.0,0.0,0.018333,0.79794,0.429715,0.002401,0.015044,0.000688,0.429715
std,191.518286,0.050421,752061.9,245588.2,0.0,0.0,0.0,0.0,0.0,0.134158,0.282648,0.219164,0.02637,0.081679,0.012769,0.222935
min,-183.0,0.0,1664583000.0,1664958000.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.254859,-0.139406,0.000164,0.000498,8.3e-05,-0.139406
25%,1.0,0.0,1665249000.0,1667898000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.755739,0.254337,0.000176,0.000615,0.000119,0.24779
50%,2.0,0.0,1665882000.0,1667938000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.931205,0.464186,0.000182,0.000744,0.000127,0.465241
75%,10.0,0.0,1666582000.0,1667976000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.977107,0.601177,0.00021,0.001392,0.000137,0.604611
max,7928.0,3.0,1667260000.0,1668012000.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.968989,0.843692,0.996931,0.619349,0.968989


the median value for networkSimilarity = 0.931205. The 25percent value is 0.755739.

In [15]:
def map_comment_networkDummy(input_data):

  input_data['networkSimilarity_DummyMed'] = np.nan
  input_data['networkSimilarity_Dummy25'] = np.nan
  j = 0

  for ind, row in input_data.iterrows():

    j += 1
    if j % 1000 == 0:
      print('finished comment '+str(j)+'/'+str(len(input_data)))

    cur_network_similarity = row['network_similarity']

    if cur_network_similarity < 0.931205:
      val = 0
    else:
      val = 1
    input_data.at[ind,'networkSimilarity_DummyMed'] = val

    if cur_network_similarity < 0.755739:
      val1 = 0
    else:
      val1 = 1 #if the value is greater than the 25th percetile value it will have 1
    input_data.at[ind,'networkSimilarity_Dummy25'] = val1

  return input_data

In [16]:
data2 = map_comment_networkDummy(data1)
print(len(data2)) #length of data
print(len(pd.unique(data2['subreddit']))) #number of subreddits considered =
print(len(pd.unique(data2['id']))) #unique number of comments = , the data is at the comment level =
print(len(pd.unique(data2['author']))) #number of author =
print(len(pd.unique(data2['parent_id']))) #number of parent nodes =
print(len(pd.unique(data2['link_id']))) #number of submissions =
print(len(data2[['author', 'parent_comment_author']].value_counts())) #number of unique counts of speaker-receiver pairs
print(len(data2.columns))

finished comment 1000/11782
finished comment 2000/11782
finished comment 3000/11782
finished comment 4000/11782
finished comment 5000/11782
finished comment 6000/11782
finished comment 7000/11782
finished comment 8000/11782
finished comment 9000/11782
finished comment 10000/11782
finished comment 11000/11782
11782
1
11782
6063
6725
2377
11100
31


In [17]:
data2.head(10)

Unnamed: 0,id,subreddit,body,author,score,gilded,created_utc,retreived_on,permalink,num_comments,...,toxicity_prob,threat_prob,date_time,date,time,date_hour,date_hour_min,culturalSimilarity_Com,networkSimilarity_DummyMed,networkSimilarity_Dummy25
5679,iqker6l,askscience,No it does not imply that. “We don’t yet know”...,omniskeptic,2,0,1664582942,1664960533,0,0,...,0.000725,0.000118,2022-10-01_00:09:02,2022-10-01,00:09:02,2022-10-01_00,2022-10-01_00:09,0.432118,1.0,1.0
5681,iqkfl8j,askscience,Pasteurization works by heating (generally a l...,jeweledjuniper,11,0,1664583360,1664960508,0,0,...,0.000734,0.000142,2022-10-01_00:16:00,2022-10-01,00:16:00,2022-10-01_00,2022-10-01_00:16,0.642043,1.0,1.0
5682,iqkfmj9,askscience,"It *absolutely* implies an expectation, even i...",chop1n,3,0,1664583378,1664960507,0,0,...,0.000864,0.000121,2022-10-01_00:16:18,2022-10-01,00:16:18,2022-10-01_00,2022-10-01_00:16,0.421561,1.0,1.0
5692,iqkpm44,askscience,"Thank you for your submission! Unfortunately, ...",askscience-modteam,1,0,1664588425,1664960199,0,0,...,0.000615,0.000124,2022-10-01_01:40:25,2022-10-01,01:40:25,2022-10-01_01,2022-10-01_01:40,0.021882,0.0,0.0
5701,iqkrd5j,askscience,Thats also what I remember. There was speculat...,greese007,2,0,1664589335,1664960145,0,0,...,0.000915,0.000128,2022-10-01_01:55:35,2022-10-01,01:55:35,2022-10-01_01,2022-10-01_01:55,0.20336,0.0,1.0
5702,iqkre4b,askscience,"Not sure if you’re writing only about insects,...",viciousfishous08,34,0,1664589349,1664960143,0,0,...,0.00161,0.000116,2022-10-01_01:55:49,2022-10-01,01:55:49,2022-10-01_01,2022-10-01_01:55,0.032065,1.0,1.0
5706,iqkww4r,askscience,Also your hearing recognizes the tones as cert...,tin_man6328,4,0,1664592247,1664959974,0,0,...,0.000659,0.000124,2022-10-01_02:44:07,2022-10-01,02:44:07,2022-10-01_02,2022-10-01_02:44,0.430613,1.0,1.0
5709,iqkyg8o,askscience,This is why you feel blinded when youre drivin...,yeswehavenotomatoes,34,0,1664593112,1664959927,0,0,...,0.027269,0.0004,2022-10-01_02:58:32,2022-10-01,02:58:32,2022-10-01_02,2022-10-01_02:58,0.239771,1.0,1.0
5713,iqkz2yw,askscience,this sounds very interesting! would love to se...,mib_sum1ls,13,0,1664593459,1664959907,0,0,...,0.001617,0.000111,2022-10-01_03:04:19,2022-10-01,03:04:19,2022-10-01_03,2022-10-01_03:04,0.310097,0.0,1.0
5715,iql02a4,askscience,Give me multiverse version of Starship Trooper...,glomgore,4,0,1664594017,1664959877,0,0,...,0.000962,0.000116,2022-10-01_03:13:37,2022-10-01,03:13:37,2022-10-01_03,2022-10-01_03:13,0.320632,0.0,1.0


check that there are no missing values in the new columns

In [19]:
print(data2['networkSimilarity_DummyMed'].isna().sum())

0


In [20]:
print(data2['networkSimilarity_Dummy25'].isna().sum())

0


check how many rows have value 1

In [22]:
print(data2['networkSimilarity_DummyMed'].sum())

5891.0


In [23]:
print(data2['networkSimilarity_Dummy25'].sum())

8836.0


if the value is greater than the 25th percetile value it will have 1. Thus almost 75percent of values have 1

In [24]:
data2.to_csv('/content/gdrive/MyDrive/Colab Notebooks/askscience/data_askscience_regressions.csv')