# Download the data (or locally run the script prepare-cmv-data.sh)

In [1]:
!gdown --fuzzy https://drive.google.com/file/d/1lJxnI5r1e13hkOprIgcSfnIEto7kol_K/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1lJxnI5r1e13hkOprIgcSfnIEto7kol_K
From (redirected): https://drive.google.com/uc?id=1lJxnI5r1e13hkOprIgcSfnIEto7kol_K&confirm=t&uuid=031be8da-107f-43e0-9d39-64b7c23cc98e
To: /content/cmv.direct_replies.csv
100% 438M/438M [00:12<00:00, 36.3MB/s]


# Load the Dataset


In [2]:
import pandas as pd
import ast

df = pd.read_csv("/content/cmv.direct_replies.csv", converters={'Replies': ast.literal_eval})

In [3]:
df.head()


Unnamed: 0,Author,Topic,Post,Replies
0,recycled_kevlar,jobs_automation_will_robots,Automation is happening at an astonishing rate...,"[(EstoAm, You are also assuming there will eve..."
1,EstoAm,jobs_automation_will_robots,"You are also assuming there will ever be a ""po...","[(recycled_kevlar, Two things: A post scarcity..."
2,recycled_kevlar,jobs_automation_will_robots,Two things: A post scarcity economy is not def...,"[(EstoAm, No the automation you are talking ab..."
3,EstoAm,jobs_automation_will_robots,No the automation you are talking about has ma...,"[(recycled_kevlar, Why? Economic value can not..."
4,recycled_kevlar,jobs_automation_will_robots,Why? Economic value can not be forced onto a t...,"[(EstoAm, Because humans have an innate drive ..."


In [4]:
len(df)

284409

# 1. Show an example of an exchange

In [5]:
import ast

row = df.iloc[17]

print(f"Author {row['Author']} posted: {row['Post']}")
print("Replies:")
for reply in row["Replies"]:
    print(f"    Author: {reply[0]}, Post: {reply[1]}")

Author DurianMD posted: So recently I've seen a lot of posts condemning Islam as a violent religion or a sexist religion. I point out that many Christians follow the bible which has numerous examples of sexism, but in application, there are numerous branches of Christianity that are no more sexist than secular groups. For example, Congregationalists and Universaliists. So, my belief is that while religion can inform the views of people, it is far more likely that religion will be used to justify actions that would have been executed any way. I think that most Jewish people don't want to stone adulterers and most Muslims don't want to stone non believers. _____ &gt; *Hello, users of CMV! This is a footnote from your moderators. We'd just like to remind you of a couple of things. Firstly, please remember to* ***[read through our rules](http://www.reddit.com/r/changemyview/wiki/rules)***. *If you see a comment that has broken one, it is more effective to report it than downvote it. Speaki

# 2. Count all exchanges


In [None]:
count = 0
for index, row in df.iterrows():
    count += len(row["Replies"])

print("Number of exchanges:", count)

Number of exchanges: 409650


# List of Authors

In [None]:
df['Author'].value_counts()

Author
[deleted]       23874
BenIncognito     1360
Hq3473           1312
GnosticGnome     1308
monkyyy          1242
                ...  
selfish             1
Emeryn              1
MerryWalrus         1
joebarts97          1
Haleljacob          1
Name: count, Length: 18350, dtype: int64

# Author = [deleted] ?
Since there are 23874 deleted post, let's check what are the topics of deleted data

In [None]:
df[df["Author"]=="[deleted]"]

Unnamed: 0,Author,Topic,Post,Replies
86,[deleted],soda_coffee_caffeine_starbucks,[deleted],"[([deleted], [deleted])]"
87,[deleted],soda_coffee_caffeine_starbucks,[deleted],"[([deleted], [deleted])]"
88,[deleted],soda_coffee_caffeine_starbucks,[deleted],"[([deleted], [deleted])]"
89,[deleted],soda_coffee_caffeine_starbucks,[deleted],"[([deleted], [deleted])]"
90,[deleted],soda_coffee_caffeine_starbucks,[deleted],"[([deleted], [deleted])]"
...,...,...,...,...
284336,[deleted],the_and_to_of,[deleted],"[(2074red2074, I never said that we don't want..."
284386,[deleted],music_song_songs_listen,"&gt; And what's wrong with that Nothing, but t...","[(call_it_art, If there's nothing wrong with i..."
284388,[deleted],music_song_songs_listen,Because it looks ridiculous. There's a differe...,"[(call_it_art, I think you're only old if you ..."
284390,[deleted],music_song_songs_listen,&gt; I think you're only old if you act old. N...,"[(call_it_art, What are some examples of the r..."


## Top 10 deleted topics

In [None]:
df[df["Author"]=="[deleted]"]["Topic"].value_counts()[:10]

Topic
the_and_to_of                       7010
abortion_fetus_prolife_abortions     472
feminism_men_feminist_women          468
meat_animals_eating_eat              413
gun_guns_firearms_weapons            410
rape_victim_victims_raped            386
gender_transgender_trans_sex         337
black_white_racism_racist            322
music_song_songs_listen              309
marriage_gay_polygamy_marry          287
Name: count, dtype: int64

# Drop deleted author

In [7]:
df = df[df["Author"] !="[deleted]"]
print(len(df))

260535


Note: Actual length 284409 <br>
after dropping deleted author rows: 260535 <br>
There was 23874 deleted author post

# 3. Top 10 authors based on post

In [9]:
author_counts = df['Author'].value_counts()

# Get the top 10 authors
top_10_authors = author_counts.head(10)

# Display the top 10 authors
print(top_10_authors)

Author
BenIncognito       1360
Hq3473             1312
GnosticGnome       1308
monkyyy            1242
huadpe             1092
Raintee97           978
z3r0shade           965
DHCKris             921
Mavericgamer        900
AnxiousPolitics     843
Name: count, dtype: int64


# 4. Top 10 authors based on comments

In [None]:
print(list(df.iloc[[17]]['Replies']))

[[('recycled_kevlar', 'Your stance relies on the assumption that religion has no influence on the actions of its followers beyond the superficial. Yet something must exist that allows this pattern to occur. Ill narrow it down to religion or culture. So, you are correct if you assume the culture dominates the religion, and you are incorrect if the reverse is true. With this in mind, I think its safe to assume the truth is somewhere in between, with both the religion and the culture somehow influencing the unrest we see.'), ('beer_demon', "The fact I don't fear an Amish flying a plane into my office tower or a Jain blowing up my bus makes me doubt that religions have no inherent features that influence people's actions.")]]


In [None]:
l = [item for t in df.iloc[[17]]['Replies'] for item in t]
len(l)

2

In [11]:
df['Replies_Count'] = df['Replies'].apply(lambda replies: len(replies) if isinstance(replies, list) else 0)
# print(df.iloc[[17]]['Num_Replies'])
author_comment_counts = df.groupby('Author')['Replies_Count'].sum()
top_10_authors = author_comment_counts.nlargest(10)

print(top_10_authors)

Author
Hq3473          1824
BenIncognito    1474
GnosticGnome    1436
monkyyy         1313
huadpe          1248
kabukistar      1133
z3r0shade       1060
Raintee97       1038
Celda           1014
DHCKris          978
Name: Replies_Count, dtype: int64


# 5. Measuring Convergence
From the [Paper](https://aclanthology.org/W11-0609.pdf), we check if one person A’s inclusion of articles in an utterance triggers the usage of articles in respondent B’s reply.

**Note** that this differs from asking whether B uses articles more often when talking to A than when talking to other people (it is not so surprising that people speak differently to different audiences).

Research Question: Does each utterance by A triggers an immediate change in B’s behavior?


In [None]:
# to measure the extent to which person B accommodates to A

In [None]:
import numpy as np
from collections import defaultdict
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Features

In [None]:
# From Section "4 Measuring linguistic style", nine LIWC-derived categories (Pennebaker et al., 2007) are articles, auxiliary verbs, conjunctions, high-frequency adverbs, impersonal pronouns, negations, personal pronouns, prepositions, and quantifiers.

liwc_categories = {
    'negations': ['RB'],
}


In [None]:
def extract_liwc_words(text):
    """
      Extract words based on LIWC categories
    """
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    liwc_words = []
    for word, tag in pos_tags:
        for category, tags in liwc_categories.items():
            if tag in tags:
                liwc_words.append(word)
                break
    return liwc_words

corpus = df.Post.values
liwc_corpus = [extract_liwc_words(doc) for doc in corpus]


In [None]:
liwc_feature = [x for each_liwc_cat in liwc_corpus for x in each_liwc_cat if x !="//www.reddit.com/message/compose"]
liwc_feature_set = set(liwc_feature)
print(liwc_feature_set)

{'additionally', 'away', 'forward', 'strategically', 'not', 'eventually', 'just', 'exactly', 'as', 'obviously', 'Very', 'However', 'Secondly', 'actually', 'only', 'then', 'already', 'really', 'Once', 'ago', 'officially', 'there', 'even', 'much', 'likely', 'no', 'far', 'first', 'currently', 'merely', 'never', 'economically', 'enough', 'somewhere', 'illegally', 'ever', 'probably', 'routinely', 'fully', 'Perhaps', 'widely', 'else', 'longer', 'however', 'historically', 'So', 'again', 'always', 'Theoretically', 'still', 'equally', "n't", 'hopefully', 'almost', 'also', 'Firstly', 'Even', 'recently', 'completely', 'very', 'Now', 'now', 'quite', 'Then', 'explicitly', 'so', 'overall'}


# Calculating convergence

In [None]:
import numpy as np


def has_feature(text, feature_set):
    """ Check if any word in the text is in the feature set. """
    words = text.lower().split()
    return any(word in feature_set for word in words)

def compute_conv_a_b(a_text, b_text):
    """ Compute ConvA,B(t) as given in the definition. """
    at = has_feature(a_text, liwc_feature_set)
    bt_given_a = has_feature(b_text, liwc_feature_set)
    return int(bt_given_a and at) - int(bt_given_a)

def overall_conv(initiator_responder_pairs):
    """ Compute the overall Conv(t) over all pairs. """
    conv_values = [compute_conv_a_b(a, b) for a, b in initiator_responder_pairs]
    return np.mean(conv_values)

In [None]:
initiator_responder_pairs = []
for _, row in df.iterrows():
    initiator_text = row['Post']
    replies = row['Replies']
    for _, responder_text in replies:
        initiator_responder_pairs.append((initiator_text, responder_text))


In [None]:
initiator_responder_pairs[0]

('Automation is happening at an astonishing rate, but the only industries that will likely be automated in 20 years is transportation and some types of administration. You also seem to be assuming post scarcity will be an overnight transition. First there will be lots of unemployable people from recently automated fields. Then the government would have to respond to facillitate to a post scarcity economy that hopefully isnt some marxist nightmare. Considering this, your parents may not be able to support you or themselves during this transition. My reccomendation is find a field that is still generations away from automation, and different from any of your familys jobs to diversify, so hopefully you can always have some income in the house.',
 'You are also assuming there will ever be a "post scarcity" period... There is very little data supporting the idea that automation = less job opportunity overall. There is some data that automation advances in a a given industry can reduce job o

# Compute overall Conv(t)

In [None]:
overall_convergence = overall_conv(initiator_responder_pairs)
print(f"Overall Conv(t) for the feature 'negations': {overall_convergence}")

Overall Conv(t) for the feature 'negations': -0.06451612903225806
