# Language in Social Context: Bridging NLP and Sociolinguistics (ESSLLI 2024)

## Day 3: Analyzing linguistic adaptation

This exercise's goal is to get familiar with code for calculating linguistic adaptation. We will use data from Tan et al. 2016 "Winning
arguments: Interaction dynamics and persuasion strategies in good-faith online discussions" (https://chenhaot.com/data/cmv/cmv.tar.bz2). We have pre-filtered it for you into a list of conversation turns.

We will look into:

1. Data preparation
2. Annotating data with LIWC categories
3. Calculating style convergence

# 1. Data preparation

## Download the data (or locally run the script prepare-cmv-data.sh)

In [1]:
!gdown --fuzzy https://drive.google.com/file/d/1lJxnI5r1e13hkOprIgcSfnIEto7kol_K/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1lJxnI5r1e13hkOprIgcSfnIEto7kol_K
From (redirected): https://drive.google.com/uc?id=1lJxnI5r1e13hkOprIgcSfnIEto7kol_K&confirm=t&uuid=2b54f973-82e9-4e28-a8cd-ac49e43cc84c
To: /content/cmv.direct_replies.csv
100% 438M/438M [00:09<00:00, 43.9MB/s]


## Load the Dataset


In [2]:
import pandas as pd
import ast

df = pd.read_csv("/content/cmv.direct_replies.csv", converters={'Replies': ast.literal_eval})

In [3]:
df.head()


Unnamed: 0,Author,Topic,Post,Replies
0,recycled_kevlar,jobs_automation_will_robots,Automation is happening at an astonishing rate...,"[(EstoAm, You are also assuming there will eve..."
1,EstoAm,jobs_automation_will_robots,"You are also assuming there will ever be a ""po...","[(recycled_kevlar, Two things: A post scarcity..."
2,recycled_kevlar,jobs_automation_will_robots,Two things: A post scarcity economy is not def...,"[(EstoAm, No the automation you are talking ab..."
3,EstoAm,jobs_automation_will_robots,No the automation you are talking about has ma...,"[(recycled_kevlar, Why? Economic value can not..."
4,recycled_kevlar,jobs_automation_will_robots,Why? Economic value can not be forced onto a t...,"[(EstoAm, Because humans have an innate drive ..."


In [4]:
len(df)

284409

## Show an example of an exchange

In [5]:
import ast

row = df.iloc[17]

print(f"Author {row['Author']} posted: {row['Post']}")
print("Replies:")
for reply in row["Replies"]:
    print(f"    Author: {reply[0]}, Post: {reply[1]}")

Author DurianMD posted: So recently I've seen a lot of posts condemning Islam as a violent religion or a sexist religion. I point out that many Christians follow the bible which has numerous examples of sexism, but in application, there are numerous branches of Christianity that are no more sexist than secular groups. For example, Congregationalists and Universaliists. So, my belief is that while religion can inform the views of people, it is far more likely that religion will be used to justify actions that would have been executed any way. I think that most Jewish people don't want to stone adulterers and most Muslims don't want to stone non believers. _____ &gt; *Hello, users of CMV! This is a footnote from your moderators. We'd just like to remind you of a couple of things. Firstly, please remember to* ***[read through our rules](http://www.reddit.com/r/changemyview/wiki/rules)***. *If you see a comment that has broken one, it is more effective to report it than downvote it. Speaki

## Count all exchanges


In [6]:
def number_of_exchanges():
  count = 0
  for index, row in df.iterrows():
      count += len(row["Replies"])

  return count

print("Number of exchanges:", number_of_exchanges())

Number of exchanges: 409650


## Top 10 topics

In [7]:
df["Topic"].value_counts()[:10]

Unnamed: 0_level_0,count
Topic,Unnamed: 1_level_1
the_and_to_of,87494
abortion_fetus_prolife_abortions,6110
gun_guns_firearms_weapons,5231
gender_transgender_trans_sex,4721
meat_animals_eating_eat,4536
black_white_racism_racist,3684
rape_victim_victims_raped,3660
feminism_men_feminist_women,3625
marriage_gay_polygamy_marry,3287
moral_morality_morals_objective,2844


## List of Authors

In [8]:
df['Author'].value_counts()

Unnamed: 0_level_0,count
Author,Unnamed: 1_level_1
[deleted],23874
BenIncognito,1360
Hq3473,1312
GnosticGnome,1308
monkyyy,1242
...,...
selfish,1
Emeryn,1
MerryWalrus,1
joebarts97,1


## Author = [deleted] ?
We need to remove these posts, because the label groups multiple authors.

In [9]:
# Function to filter out replies deleted authors
def remove_deleted(replies):
    return [reply for reply in replies if reply[0] != "[deleted]"]

# removing conversation turns with [deleted] as the auhor starting the conversation
df = df[df["Author"] !="[deleted]"]

# removing replies from [deleted]
df['Replies'] = df['Replies'].apply(remove_deleted)

print(number_of_exchanges())

338385


## Top 10 authors based on post

In [10]:
author_counts = df['Author'].value_counts()

# Get the top 10 authors (only starting conversations)
top_10_authors = author_counts.head(10)

# Display the top 10 authors
print(top_10_authors)

Author
BenIncognito       1360
Hq3473             1312
GnosticGnome       1308
monkyyy            1242
huadpe             1092
Raintee97           978
z3r0shade           965
DHCKris             921
Mavericgamer        900
AnxiousPolitics     843
Name: count, dtype: int64


## Top 10 authors based on replies

In [11]:
print(list(df.iloc[[17]]['Replies']))

[[('recycled_kevlar', 'Your stance relies on the assumption that religion has no influence on the actions of its followers beyond the superficial. Yet something must exist that allows this pattern to occur. Ill narrow it down to religion or culture. So, you are correct if you assume the culture dominates the religion, and you are incorrect if the reverse is true. With this in mind, I think its safe to assume the truth is somewhere in between, with both the religion and the culture somehow influencing the unrest we see.'), ('beer_demon', "The fact I don't fear an Amish flying a plane into my office tower or a Jain blowing up my bus makes me doubt that religions have no inherent features that influence people's actions.")]]


In [12]:
df['Replies_Count'] = df['Replies'].apply(lambda replies: len(replies) if isinstance(replies, list) else 0)

author_comment_counts = df.groupby('Author')['Replies_Count'].sum()
top_10_authors = author_comment_counts.nlargest(10)

print(top_10_authors)

Author
Hq3473          1730
BenIncognito    1391
GnosticGnome    1357
monkyyy         1238
huadpe          1199
kabukistar      1031
Raintee97       1022
z3r0shade        980
DHCKris          948
Mavericgamer     893
Name: Replies_Count, dtype: int64


# 2. Annotating data with LIWC categories



In [13]:
import numpy as np
from collections import defaultdict
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

In [14]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Features

In [15]:
# From Section "4 Measuring linguistic style", nine LIWC-derived categories (Pennebaker et al., 2007) are articles, auxiliary verbs, conjunctions, high-frequency adverbs, impersonal pronouns, negations, personal pronouns, prepositions, and quantifiers.

liwc_categories = {
    'articles': ['DT'],  # Determiner
    'auxiliary_verbs': ['MD', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],  # Modal/Auxiliary verbs
    'conjunctions': ['CC'],  # Coordinating conjunction
    'high_freq_adverbs': ['RB', 'RBR', 'RBS'],  # Adverbs
    'impersonal_pronouns': ['PRP$', 'WP$', 'WP'],  # Impersonal pronouns
    'negations': ['RB'],  # since 'not', 'never' are adverbs
    'personal_pronouns': ['PRP'],  # Personal pronouns
    'prepositions': ['IN'],  # Prepositions
    'quantifiers': ['CD', 'PDT']  # Cardinals, Predeterminers
}


In [16]:
def extract_liwc_cats(text):
    """
      Extract LIWC categories
    """
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    liwc_cats = set([])
    for word, tag in pos_tags:
        for category, tags in liwc_categories.items():
            if tag in tags:
                liwc_cats.add(category)

    return liwc_cats

# 3. Calculating convergence

From the [Paper](https://aclanthology.org/W11-0609.pdf), we check if one person A’s inclusion of articles in an utterance triggers the usage of articles in respondent B’s reply.

**Note** that this differs from asking whether B uses articles more often when talking to A than when talking to other people (people speak differently to different audiences).

Research Question: Does each utterance by A triggers an immediate change in B’s behavior?


In [23]:
# count all the frequencies for probabilities (this will take a while...)
reply_freqs = { }
post_cat_freqs = { }
reply_cat_freqs = { }

cache_liwc = { }
def get_liwc_cats(text):
  if text in cache_liwc:
      return cache_liwc[text]

  cats = extract_liwc_cats(text)
  cache_liwc[text] = cats
  return cats

# let's process something smaller for the time reasons
for index, row in df.head(5000).iterrows():
    post_author = row["Author"]
    if post_author not in reply_freqs:
      reply_freqs[post_author] = { }
      post_cat_freqs[post_author] = { }
      reply_cat_freqs[post_author] = { }

    post_cats = get_liwc_cats(row["Post"])
    for cat in post_cats:
        if cat not in post_cat_freqs[post_author]:
            post_cat_freqs[post_author][cat] = 1
        else:
            post_cat_freqs[post_author][cat] += 1

    for (author, reply) in row["Replies"]:
        if author not in reply_freqs[post_author]:
            reply_freqs[post_author][author] = 1
            reply_cat_freqs[post_author][author] = { }
        else:
            reply_freqs[post_author][author] += 1

        reply_cats = get_liwc_cats(reply)
        for cat in reply_cats:
            if cat not in reply_cat_freqs[post_author][author]:
                reply_cat_freqs[post_author][author][cat] = 1
            else:
                reply_cat_freqs[post_author][author][cat] += 1

## Calculating convergence

In [30]:
import numpy as np

def compute_prob_of_cat_reply(a, b, category):
    if category not in reply_cat_freqs[a][b]:
      return 0

    all_replies = sum(reply_cat_freqs[a][b].values())
    cat_replies = reply_cat_freqs[a][b][category]

    return cat_replies / all_replies

def compute_prob_of_cat_post(a, category):
    if category not in post_cat_freqs[a]:
        return 0

    all_posts = sum(post_cat_freqs[a].values())
    cat_replies = post_cat_freqs[a][category]

    return cat_replies / all_posts

In [31]:
# let's see some examples:
print(compute_prob_of_cat_reply("recycled_kevlar", "EstoAm", "articles"))
print(compute_prob_of_cat_post("recycled_kevlar", "articles"))

0.1282051282051282
0.1206896551724138


In [32]:
# TODO Based on the equasion from the paper, finish the code to get the difference between the two probabilites
