# When and how to nudge conversations back on track?

This notebook brings in two datasets of moderated conversations and explores the data to determine the kind of nudges that defuse a conversation that was previously likely to go awry. By awry here, we mean conversations devolving into toxicity and potential mdoeration outcomes.

## Imports

In [1]:
import pandas as pd
import numpy as np
from convokit import Corpus, download




## Datasets

The first dataset, om-data.csv, is data from the PushShift API in 2018 consisting of 10k comments from 107 different subreddits, half of which are removed. Comments made by authors who have since deleted their accounts have been filtered out.

In [2]:
moderated_df = pd.read_csv('om-data.csv', low_memory=False)
moderated_df.head()

Unnamed: 0.1,Unnamed: 0,id,author,author_flair_css_class,author_flair_text,body,body_html,created,created_utc,distinguished,...,parent_id,score,score_hidden,stickied,subreddit,subreddit_id,ups,removed,edited,controversiality
0,207099101,dbjdtyr,Disheartend,20k,hi,I have 69 step authentication :(,,,1482478000.0,,...,t1_dbi64q2,1.0,,False,2007scape,t5_2wbww,,False,False,0.0
1,105244658,d73hifd,abdulcool1,ranging,,"Yeah, I started. Its about 60-70 points per hr.",,,1472610000.0,,...,t1_d730l9d,1.0,,False,2007scape,t5_2wbww,1.0,False,False,0.0
2,123778740,d7woshc,StayyFrostyy,,,but the helm slot is always freed? you dont ne...,,,1474497000.0,,...,t1_d7w7igx,1.0,,False,2007scape,t5_2wbww,1.0,False,False,0.0
3,7223695,d2un5cu,Loganimal,quests,,Ook ook (I agree),,,1462501000.0,,...,t3_4i35r9,10.0,,False,2007scape,t5_2wbww,10.0,False,False,0.0
4,118173056,d7nz8ld,EJK_,firemaking,,Meanwhile I'm running down from NorthWest to h...,,,1473957000.0,,...,t3_52w8wl,2.0,,False,2007scape,t5_2wbww,2.0,False,False,0.0


In [3]:
wikipedia_df = Corpus(filename=download("conversations-gone-awry-corpus"))
reddit_df = Corpus(filename=download("conversations-gone-awry-cmv-corpus-large"))

Dataset already exists at /Users/himnish.hunma/.convokit/saved-corpora/conversations-gone-awry-corpus
Dataset already exists at /Users/himnish.hunma/.convokit/saved-corpora/conversations-gone-awry-cmv-corpus-large


## Data cleaning and organization

We want to analyze threads that have a high propensity to go awry but don't get moderated, or threads whose propensity to go awry suddenly drop. To do this, we need to organize the data in their conversational structure.

The ConvoKit corpus stores utterances and separately stores the order in which the utterances appeared in the thread. See the [Convokit documentation](https://convokit.cornell.edu/documentation/tutorial.html#:~:text=We%20get%20a%20%E2%80%98linear%E2%80%99%20conversation%20that%20does%20not%20branch%20out%20into%20subtrees.) for details.

## Preliminary analysis

This section dives through the dataset to get some summary statistics and understand the layout of the data.

### Reddit CMV data

In [4]:
reddit_df.print_summary_stats()

Number of Speakers: 24555
Number of Utterances: 116793
Number of Conversations: 19578


In [5]:
reddit_convo_df = reddit_df.get_conversations_dataframe()
reddit_convo_df.head()
reddit_convo_df.groupby('meta.has_removed_comment').agg(number_removed_comment=('meta.has_removed_comment', 'count'))

Unnamed: 0_level_0,number_removed_comment
meta.has_removed_comment,Unnamed: 1_level_1
False,9789
True,9789


In [6]:
reddit_utterances_df = reddit_df.get_utterances_dataframe()
reddit_utterances_df.head()
reddit_utterances_df.groupby('conversation_id').agg(conversational_depth=('speaker', 'count')).max()

conversational_depth    41
dtype: int64

### Wikipedia data

In [7]:
wikipedia_df.print_summary_stats()


Number of Speakers: 8069
Number of Utterances: 30021
Number of Conversations: 4188


In [8]:
wikipedia_convo_df = wikipedia_df.get_conversations_dataframe()
wikipedia_convo_df.groupby('meta.conversation_has_personal_attack').agg(number_personal_attack=('meta.conversation_has_personal_attack', 'count'))

Unnamed: 0_level_0,number_personal_attack
meta.conversation_has_personal_attack,Unnamed: 1_level_1
False,2094
True,2094


In [9]:
wikipedia_utterances_df = wikipedia_df.get_utterances_dataframe()
wikipedia_utterances_df.head()
wikipedia_utterances_df.groupby('conversation_id').agg(conversational_depth=('speaker', 'count')).max()

conversational_depth    20
dtype: int64

### Pushshift data

In [10]:
reddit_conversation_df = reddit_df.get_conversations_dataframe()
reddit_conversation_df["meta.has_removed_comment"].count()

np.int64(19578)

In [11]:
print(f"Pushshift DF removed comments: {moderated_df.query('removed == True')['removed'].count()}, unremoved comments: {moderated_df.query('removed == False')['removed'].count()}")

Pushshift DF removed comments: 535000, unremoved comments: 534980


In [12]:
print(f"Pushshift DF number of unique posts: {len(moderated_df['link_id'].unique())}")

Pushshift DF number of unique posts: 483016


In [13]:
print(f"Mean conversational depth: {moderated_df.groupby('link_id').agg(conversation_depth=('parent_id', 'count'),).mean()}, Max conversational depth: {moderated_df.groupby('link_id').agg(conversation_depth=('parent_id', 'count'),).max()}")

Mean conversational depth: conversation_depth    2.21525
dtype: float64, Max conversational depth: conversation_depth    761
dtype: int64


## Pushshift data organization

This section attempts to run a conversational derailment forecaster on the Pushshift dataset and identify false positives. Some research questions motivating the exploration below:

- where do attacks usually happen in a conversation?
- how does the pushshift structure differ from the convokit structure and how does that affect whether convokit models can be run against it?
-

In [14]:
moderated_df.head()

Unnamed: 0.1,Unnamed: 0,id,author,author_flair_css_class,author_flair_text,body,body_html,created,created_utc,distinguished,...,parent_id,score,score_hidden,stickied,subreddit,subreddit_id,ups,removed,edited,controversiality
0,207099101,dbjdtyr,Disheartend,20k,hi,I have 69 step authentication :(,,,1482478000.0,,...,t1_dbi64q2,1.0,,False,2007scape,t5_2wbww,,False,False,0.0
1,105244658,d73hifd,abdulcool1,ranging,,"Yeah, I started. Its about 60-70 points per hr.",,,1472610000.0,,...,t1_d730l9d,1.0,,False,2007scape,t5_2wbww,1.0,False,False,0.0
2,123778740,d7woshc,StayyFrostyy,,,but the helm slot is always freed? you dont ne...,,,1474497000.0,,...,t1_d7w7igx,1.0,,False,2007scape,t5_2wbww,1.0,False,False,0.0
3,7223695,d2un5cu,Loganimal,quests,,Ook ook (I agree),,,1462501000.0,,...,t3_4i35r9,10.0,,False,2007scape,t5_2wbww,10.0,False,False,0.0
4,118173056,d7nz8ld,EJK_,firemaking,,Meanwhile I'm running down from NorthWest to h...,,,1473957000.0,,...,t3_52w8wl,2.0,,False,2007scape,t5_2wbww,2.0,False,False,0.0
