<img src="https://support.appsflyer.com/hc/article_attachments/360001968989/twitter_logo.jpg" style='float:right; width:200px; margin: 0 20px;'>

# Twitter Stance Classification
---

> Simple example of running the MAXCUT algorithem for `stance classification` in Twitter conversations


In [2]:
# env 
import sys
sys.path.append('/Users/shaimeital/code/thesis/cmv-stance-classification')
sys.path.append('/Users/shaimeital/code/thesis/conversant')

In [41]:
from collections import Counter
from itertools import combinations
from typing import Sequence, Tuple, Iterable
import numpy as np
import tqdm
import pandas as pd
from stance_classification.reddit_conversation_parser import CMVConversationReader
from stance_classification.utils import iter_trees_from_lines
from conversant.conversation.parse import ConversationParser
from conversant.conversation import NodeData
import networkx as nx

## Load conversation data
We load conversation data in jsonL format using the `Conversant` package 

In [4]:
conv_id = "1274782757907030016"
sample= pd.read_json(f'/Users/shaimeital/code/thesis/cmv-stance-classification/data/Twitter Conversation/conv-{conv_id}/tweets.jsonl',
                     lines= True,
                     dtype={'in_reply_to_status_id_str':str, 'conversation_id_str': str, 'text':str, 'id_str':'str'})
sample.head()

Unnamed: 0,created_at,id_str,text,truncated,entities,source,in_reply_to_status_id_str,in_reply_to_user_id_str,in_reply_to_screen_name,user_id_str,retweet_count,favorite_count,conversation_id_str,lang,is_quote_status,quoted_status_id_str,possibly_sensitive_editable,self_thread,place,extended_entities
0,2020-06-21 21:28:47+00:00,1274816558519418880,"@ylecun Yann, here's a fun experiment: trainin...",1.0,"{'user_mentions': [{'screen_name': 'ylecun', '...","<a href=""http://twitter.com/#!/download/ipad"" ...",1.27478275790703e+18,48008938.0,ylecun,39547749,6,61,1274782757907030016,en,,,,,,
1,2020-06-21 19:47:51+00:00,1274791158913302529,@RandomlyWalking Because people should be awar...,,{'user_mentions': [{'screen_name': 'RandomlyWa...,"<a href=""http://twitter.com/download/android"" ...",1.274783738929406e+18,21815759.0,RandomlyWalking,48008938,0,128,1274782757907030016,en,,,,,,
2,2020-06-20 16:36:36+00:00,1274380641644294150,This image speaks volumes about the dangers of...,,"{'urls': [{'url': 'https://t.co/GsoQqSr3XP', '...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,,,2393183286,618,1942,1274380641644294150,en,1.0,1.274315e+18,1.0,{'id_str': '1274380641644294150'},,
3,2020-06-21 21:09:35+00:00,1274811728296017928,@ylecun Train it on the *WHOLE* American popul...,1.0,"{'user_mentions': [{'screen_name': 'ylecun', '...","<a href=""https://mobile.twitter.com"" rel=""nofo...",1.27478275790703e+18,48008938.0,ylecun,144553035,49,326,1274782757907030016,en,1.0,1.204053e+18,1.0,,,
4,2020-06-21 19:14:28+00:00,1274782757907030016,ML systems are biased when data is biased.\nTh...,1.0,"{'urls': [{'url': 'https://t.co/O7IBsrUB1b', '...","<a href=""http://twitter.com/download/android"" ...",,,,48008938,515,2733,1274782757907030016,en,1.0,1.274381e+18,1.0,,,


In [5]:
cols = ['created_at', 'id_str', 'text', 'user_id_str', 'conversation_id_str', 'in_reply_to_status_id_str', 'in_reply_to_screen_name']

In [6]:
sample = sample.filter(cols)
sample.head()

Unnamed: 0,created_at,id_str,text,user_id_str,conversation_id_str,in_reply_to_status_id_str,in_reply_to_screen_name
0,2020-06-21 21:28:47+00:00,1274816558519418880,"@ylecun Yann, here's a fun experiment: trainin...",39547749,1274782757907030016,1.27478275790703e+18,ylecun
1,2020-06-21 19:47:51+00:00,1274791158913302529,@RandomlyWalking Because people should be awar...,48008938,1274782757907030016,1.274783738929406e+18,RandomlyWalking
2,2020-06-20 16:36:36+00:00,1274380641644294150,This image speaks volumes about the dangers of...,2393183286,1274380641644294150,,
3,2020-06-21 21:09:35+00:00,1274811728296017928,@ylecun Train it on the *WHOLE* American popul...,144553035,1274782757907030016,1.27478275790703e+18,ylecun
4,2020-06-21 19:14:28+00:00,1274782757907030016,ML systems are biased when data is biased.\nTh...,48008938,1274782757907030016,,


Make sure we have only one conversation in our data

In [10]:
assert sample.conversation_id_str.nunique() == 1, 'data have more then one conversation'

In [8]:
# quick fix - more then 1 conversation id
print(sample.count().values[0])
sample = sample.loc[sample.conversation_id_str == conv_id]
print(sample.count().values[0])

264
255


In [9]:
assert sample.loc[sample.in_reply_to_status_id_str == 'nan'].count().values[0] == 1, \
                                                      'more then one root found in conversation'

In [11]:
# quick fix - more then 1 root
print(sample.count().values[0])
sample = sample.loc[sample.id_str != '1203416002626805765']
print(sample.count().values[0])


255
255


In [12]:
assert len(set(sample.in_reply_to_status_id_str).difference(set(sample.id_str))) > 1, 'Some parents are not in the id list'

In [13]:
sample.head()

Unnamed: 0,created_at,id_str,text,user_id_str,conversation_id_str,in_reply_to_status_id_str,in_reply_to_screen_name
0,2020-06-21 21:28:47+00:00,1274816558519418880,"@ylecun Yann, here's a fun experiment: trainin...",39547749,1274782757907030016,1.27478275790703e+18,ylecun
1,2020-06-21 19:47:51+00:00,1274791158913302529,@RandomlyWalking Because people should be awar...,48008938,1274782757907030016,1.274783738929406e+18,RandomlyWalking
3,2020-06-21 21:09:35+00:00,1274811728296017928,@ylecun Train it on the *WHOLE* American popul...,144553035,1274782757907030016,1.27478275790703e+18,ylecun
4,2020-06-21 19:14:28+00:00,1274782757907030016,ML systems are biased when data is biased.\nTh...,48008938,1274782757907030016,,
5,2020-06-21 19:46:20+00:00,1274790777516961792,@ArcusCoTangens Not so much ML researchers but...,48008938,1274782757907030016,1.274784988983169e+18,ArcusCoTangens


Single conversation parsing

In [14]:
class Twitterconversationreader(ConversationParser[pd.DataFrame, pd.Series]):

    def __init__(self):
        super().__init__()


    def extract_node_data(self, raw_node: pd.Series, verbose =0) -> NodeData:
        node_id = raw_node.id_str
        if verbose > 0:
            print(f'parsing node id {node_id}')
        author = raw_node.user_id_str
        timestamp = raw_node.created_at
        data = dict(raw_node)
        parent_id = data.get('in_reply_to_status_id_str')
        if parent_id == 'nan':
            parent_id = None
            print(f'root node is {node_id}')
            
        return NodeData(node_id, author, timestamp, data, parent_id)
        

    def iter_raw_nodes(self, raw_conversation: pd.DataFrame) -> Iterable[pd.Series]:
        for x in raw_conversation.iterrows():
            yield x[1]

In [15]:
twitter_reader = Twitterconversationreader()

In [16]:
conversation = twitter_reader.parse(sample)

root node is 1274782757907030016


In [17]:
len(conversation.participants)

152

In [18]:
len(list(conversation.iter_conversation()))

247

## Create interaction graph from conversation object

In [31]:
from conversant.interactions.reply_interactions_parser import get_reply_interactions_parser

reply_interactions_parser = get_reply_interactions_parser()

interaction_graph = reply_interactions_parser.parse(conversation)


In [20]:
interaction_graph.get_core_interactions(inplace=True)

<conversant.interactions.interactions_graph.InteractionsGraph at 0xa176712d0>


## Apply MAXCUT algo for stance classification in the conversation

In [None]:
interaction_graph.set_interaction_weights(lambda x: x['replies'])

In [None]:
from stance_classification.classifiers.maxcut_stance_classifier import MaxcutStanceClassifier

In [None]:
maxcut = MaxcutStanceClassifier()

In [None]:
maxcut.set_input(interaction_graph.graph)

In [None]:
op = conversation.root.author

In [None]:
maxcut.classify_stance(op)

In [None]:
maxcut.draw()

## Interaction Graph Feature Engineering

In [43]:
graph = interaction_graph.graph

In [51]:
graph.degree([0])

DegreeView({})

In [85]:
# interactions features

In [83]:
features = {'clustering_coeff': nx.average_clustering(graph),
            'average_degree': np.mean([value for _, value in graph.degree()]),
            'average_betweeness_centrality': np.mean([value for value in nx.betweenness_centrality(graph).values()]),
            '#two-core_nodes': len(nx.k_core(graph,2)),
            '#three-core_nodes': len(nx.k_core(graph,3)),
            'two-core_to_full_graph_ratio': len(nx.k_core(graph,2)) / len(graph),
            'three-core_to_full_graph_ratio': len(nx.k_core(graph,3)) / len(graph)}
            
            

In [84]:
pd.DataFrame(features, index=[0])

Unnamed: 0,clustering_coeff,average_degree,average_betweeness_centrality,#two-core_nodes,#three-core_nodes,two-core_to_full_graph_ratio,three-core_to_full_graph_ratio
0,0.036186,2.263158,0.009737,36,0,0.236842,0.0
