# Quantitative Approaches to Discourse on Social Media
### by Tatjana Scheffler, University of Potsdam (tatjana.scheffler@uni-potsdam.de)

Heidelberg Computational Humanities summer school<br/>
Heidelberg<br/>
July 16, 2019


This assumes that you have a working Python 3 distribution (for example through Anaconda: https://www.continuum.io/downloads). You may have to install some packages that are used below.

## Working with Tweets

Importing a bunch of packages. 

In [None]:
import simplejson as json
import pandas as pd
import numpy as np
#import nltk

import langid

# display option (don't cut off text in dataframe columns)
pd.set_option('display.max_colwidth', -1)

We'll be using a thread about hockey as an example here. You can find it at https://bit.ly/2YERjhD (The password should be known to you.)
Here I'm loading the data from the output of a Twarc search of Twitter (a file of json objects):

In [None]:
tweets = []

with open("cangertweets.json") as f:
    for line in f:
        tweet = json.loads(line)
        tweets.append(tweet)
        
print(len(tweets))

In [None]:
# as a dataframe
df = pd.DataFrame(data = tweets)
print(list(df))

df

## 1. Language Identification

Most of the tweets are probably in English, but maybe there are some others?

We can look at Twitter's own language identification. But note that it is notoriously bad (it seems to mostly reflect the *user's* profile language setting). 

In [None]:
df['lang'].describe()

In [None]:
df.loc[~(df['lang'] == 'en')][['full_text','lang']]

`langid` is a Python package that does language identification. It usually works quite well.

In [None]:
langid.classify(df['full_text'][0])

In [None]:
df['langid_lang'] = df.apply(lambda x: langid.classify(x['full_text'])[0], axis=1)
df['langid_lang'].describe()

In [None]:
df.loc[~(df['lang'] == df['langid_lang'])][['full_text','lang','langid_lang']]

It seems that `langid` gets confused easily by single letters - in fact, only the name "@KanadaBotschaft" makes it change its opinion from English to German:

In [None]:
print(langid.classify('@GermanyDiplo @TeamD @CanadaFP @GermanyInCanada @KanadaBotschaft Hot chocolate for everyone! 😀 '))
print(langid.classify('@GermanyDiplo @TeamD @CanadaFP @GermanyInCanada Hot chocolate for everyone! 😀 '))

One could try and use this info to improve language identification a bit more... (How?)

In [None]:
# some ideas here?

## 2. Tokenizing and Part of Speech Tagging Tweets

### TweetNLP: 

TweetNLP is a standalone tokenizer and part of speech tagger, which you can run from the command line:
http://www.cs.cmu.edu/~ark/TweetNLP/

It takes a file with only tweet text as input, so we'll have to create one.

In [None]:
outfile = open("hockeytweets.txt","w") 
for tweet in tweets:
    text = tweet["full_text"].replace('\n',' ')
    text = text.replace('\t', ' ')
    outfile.write(text + '\n')
    
outfile.close()

hockeytweets = [line.strip() for line in open('hockeytweets.txt')]
print ("Number of tweets in hockey thread: " + str(len(hockeytweets)))
print (hockeytweets[15])

Now we can run the tagger: 

`./runTagger.sh input > output`

`./runTagger.sh --no-confidence hockeytweets.txt > hockeytweets-tagged.tsv`

In [None]:
hockeytweets_tagged = [line.strip() for line in open('hockeytweets-tagged.tsv')]

tags_df = pd.read_csv('hockeytweets-tagged.tsv', sep='\t', header=None, names=['tokens', 'tags','text'])

tags_df
#tags_df['tokens']

### Some other ways to look at tokenization / tagging

Just for future reference, we'll skip ahead to the next part

In [None]:
# (DR)
# Look at tokens and tags

tokens, tags, text = hockeytweets_tagged[98].split('\t')
print(tokens)
print(tags)

ttags = tags.split(" ")
ttoks = tokens.split(" ")
for tok, tag in zip(ttags, ttoks):
#    print(tok + "\t" + tag + "\n") # uncomment to see token - tag pairs
    pass


There is a Python port of the Twitter tokenizer from TweetNLP which can be downloaded from here: https://github.com/myleott/ark-twokenize-py

It allows us to use the tokenizer right in Python.

In [None]:
from twokenize import *

tokenizeRawTweetText("@GermanyDiplo @TeamD @CanadaFP @GermanyInCanada @KanadaBotschaft Canada did, they beat the German woman's soccer team in the RIO Olympics.")

## Tokenize the 'full_text' column of our original dataframe and save the result into a new column 'text_tokens'
# df['text_tokens_1'] = df.apply(lambda x: tokenizeRawTweetText(x['full_text']), axis=1)
# df['text_tokens_1'][0]

### Somajo / Someweta

For German, you can use the tokenizer Somajo (https://github.com/tsproisl/SoMaJo) and POS tagger Someweta (https://github.com/tsproisl/SoMeWeTa). They can be used as Python packages.

In [None]:
from somajo import Tokenizer

tok = Tokenizer(split_camel_case=True, token_classes=True, extra_info=True)

tokenized_tweets = []

for tweet in hockeytweets_tagged:
    tokens, tags, text = tweet.split('\t')
    tokenized = tok.tokenize(text)
    tokenized_tweets.append(tokenized)
    
print (tokenized_tweets[1977])

### Using POS tags

In [None]:
df['tokens'] = tags_df['tokens']
df['tags'] = tags_df['tags']
df.head()[['full_text','tokens','tags','langid_lang']]

In [None]:
# clean text

def remove_handles(tokens, tags):
    tok_list = tokens.split(' ')
    tag_list = tags.split(' ')
    res = []
    for tok, tag in zip(tok_list, tag_list):
        if tag != '@': 
            if tag != 'U':
                res.append(tok)
    return(' '.join(res))

remove_handles(df['tokens'][0],df['tags'][0])

df['text_only'] = df.apply(lambda x: remove_handles(x['tokens'],x['tags']), axis=1)
#df['text_only']

def normalize_text(tokens, tags):
    tok_list = tokens.split(' ')
    tag_list = tags.split(' ')
    res = []
    for tok, tag in zip(tok_list, tag_list):
        if tag == '@': 
            res.append('%USER%')
        elif tag == 'U':
            res.append('%URL%')
        else:
            res.append(tok)
    return(' '.join(res))

df['normalized_text'] = df.apply(lambda x: normalize_text(x['tokens'],x['tags']), axis=1)
#df['normalized_text']

Now, we can recompute the language identification based only on the text: 

In [None]:
df['lang_new'] = df.apply(lambda x: langid.classify(x['text_only'])[0], axis=1)
df['lang_new'].describe()
#df.loc[~(df['lang_new'] == df['langid_lang'])][['full_text','lang_new','langid_lang']]

Language identification on very short tweets is difficult. One may want to rely on Twitter's own classification below a certain minimum length of characters.

## Conversation structure

We can visualize the entire conversation with the browser extension (Treeverse)[https://github.com/paulgb/Treeverse]. 

Let's build a discussion tree in Python connecting all the tweets to their replies (and vice versa).

In [None]:
from twitterconversations import *

discussion_threads, replies_dict = make_discussions(tweets)

print_discussions(discussion_threads)

In [None]:
# Use the tweet id as the dataframe index
df.set_index('id_str',inplace=True)

In [None]:
# Add a list of replies to each item

df['replies'] = np.empty((len(df), 0)).tolist()  # create empty lists of direct replies

for index, row in df.iterrows():
    row['replies'] += replies_dict[index]
                

## Questions

Now we can for example find all tweets that contain questions and their answers.

In [None]:
questions = df[df['full_text'].str.contains('\?')]
questions[['full_text','replies']]

In [None]:
# How many questions have answers?
print("Questions with answers: " + str(len(questions[questions['replies'].map(lambda d: len(d)) > 0]))) 

q_a_pairs = []
for idx, q in questions.iterrows():
    if q['replies']:
        q_text = df.loc[idx,'full_text']
        for a in q['replies']:
            a_text = df.loc[a,'full_text']
            q_a_pairs.append((q_text,a_text))
            
print("Question-answer pairs: " + str(len(q_a_pairs)))
q_a_pairs

## Metadata

* Users
* Hashtags


In [None]:
df['user_id_str'] = df.apply(lambda x: x['user']['id_str'], axis=1)
df['user_id_str']

## Visualization



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# number of replies
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
g = sns.countplot(df.apply(lambda x: len(x['replies']), axis=1))
g.set_yscale('log')
plt.subplot(1, 2, 2)
g = sns.countplot(questions.apply(lambda x: len(x['replies']), axis=1))
g.set_yscale('log')
plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
g = sns.countplot(df.apply(lambda x: len(x['replies'])>0, axis=1))
g.set_yscale('log')
plt.subplot(1, 2, 2)
g = sns.countplot(questions.apply(lambda x: len(x['replies'])>0, axis=1))
g.set_yscale('log')
plt.tight_layout()
plt.show()

How often do users contribute in the thread?

In [None]:
g = sns.countplot(df['user_id_str'].value_counts())
g.set_yscale('log')
plt.xlabel("times in dataset") 
plt.ylabel("number of users")
plt.show()

## Emoji

Find all tweets with emoji.

In [None]:
import emoji
from collections import defaultdict

emojitweets = defaultdict(list)

def find_emoji(text):
    emojis_found=[]
    for c in text:
        if c in emoji.UNICODE_EMOJI and not c in emojis_found:
            emojitweets[c].append(text)
            emojis_found.append(c)
    return()
               
for idx, text in df['text_only'].iteritems():
    find_emoji(text)

emojitweets

In [None]:
plt.figure(figsize=(25, 4))
plt.bar(emojitweets.keys(),[len(x) for x in emojitweets.values()])
plt.show()

In [None]:
for emoji in sorted(emojitweets, key=lambda x:len(emojitweets[x]), reverse=True):
    print(emoji, len(emojitweets[emoji]))

## Open questions

* How are the emojis distributed across levels of discussion? E.g., if there is one emoji in a higher-up message, is it more likely that more emoji will follow?
* How are links and hashtags distributed in different branches?
* Are there any users who contribute to different sub-branches of the tree?
* Does the number of followers a user has influence the probability that their post/questions is answered?

(For all these, one should probably exclude the root tweet from consideration)