<img src="https://support.appsflyer.com/hc/article_attachments/360001968989/twitter_logo.jpg" style='float:right; width:200px; margin: 0 20px;'>

# Twitter Stance Classification
---

> Simple example of running the MAXCUT algorithem for `stance classification` in Twitter conversations


In [1]:
# env 
import sys
sys.path.append('/Users/shaimeital/code/thesis/cmv-stance-classification')
sys.path.append('/Users/shaimeital/code/thesis/conversant')

In [2]:
from collections import Counter
from itertools import combinations
from typing import Sequence, Tuple, Iterable
import numpy as np
import tqdm
import pandas as pd
from stance_classification.reddit_conversation_parser import CMVConversationReader
from stance_classification.utils import iter_trees_from_lines
from conversant.conversation.parse import ConversationParser
from conversant.conversation import NodeData

## Load conversation data
We load conversation data in jsonL format using the `Conversant` package 

In [4]:
sample= pd.read_json('/Users/shaimeital/code/thesis/cmv-stance-classification/data/Twitter Conversation/conv-1203211859366576128/tweets.jsonl',
                     lines= True,
                     dtype={'in_reply_to_status_id_str':str})
sample.head()

Unnamed: 0,created_at,id_str,text,truncated,entities,source,user_id_str,retweet_count,favorite_count,conversation_id_str,possibly_sensitive_editable,lang,in_reply_to_status_id_str,in_reply_to_user_id_str,in_reply_to_screen_name,place,self_thread,is_quote_status,quoted_status_id_str,extended_entities
0,2019-12-07 07:17:16+00:00,1203211859366576128,"People are biased.\nData is biased, in part be...",1.0,"{'urls': [{'url': 'https://t.co/tP2x8DD8Nr', '...","<a href=""http://www.facebook.com/twitter"" rel=...",48008938,1508,3674,1203211859366576128,1.0,en,,,,,,,,
1,2019-12-07 14:01:40+00:00,1203313629082378240,@tunguz @ylecun This is an interesting read:\n...,1.0,"{'user_mentions': [{'screen_name': 'tunguz', '...","<a href=""https://about.twitter.com/products/tw...",10756512,3,52,1203211859366576128,1.0,en,1.203276649049002e+18,23511270.0,tunguz,,,,,
2,2019-12-07 08:36:45+00:00,1203231861964771328,@ylecun Learning algorithms are designed by pe...,,"{'user_mentions': [{'screen_name': 'ylecun', '...","<a href=""https://mobile.twitter.com"" rel=""nofo...",2883830368,1,13,1203211859366576128,,en,1.203211859366576e+18,48008940.0,ylecun,,,,,
3,2019-12-07 07:23:59+00:00,1203213550900170752,@ylecun But I just hope biased data will not b...,,"{'user_mentions': [{'screen_name': 'ylecun', '...","<a href=""http://twitter.com/download/android"" ...",1309820665,1,40,1203211859366576128,,en,1.203211859366576e+18,48008940.0,ylecun,,,,,
4,2019-12-07 16:30:43+00:00,1203351140324110336,"@FulvioFlamini @ylecun In the best scenarios, ...",1.0,{'user_mentions': [{'screen_name': 'FulvioFlam...,"<a href=""http://twitter.com/download/iphone"" r...",1149835788,3,9,1203211859366576128,,en,1.2032318619647713e+18,2883830000.0,FulvioFlamini,,,,,


In [5]:
cols = ['created_at', 'id_str', 'text', 'user_id_str', 'conversation_id_str', 'in_reply_to_status_id_str', 'in_reply_to_screen_name']

In [6]:
sample = sample.filter(cols)
sample.head()

Unnamed: 0,created_at,id_str,text,user_id_str,conversation_id_str,in_reply_to_status_id_str,in_reply_to_screen_name
0,2019-12-07 07:17:16+00:00,1203211859366576128,"People are biased.\nData is biased, in part be...",48008938,1203211859366576128,,
1,2019-12-07 14:01:40+00:00,1203313629082378240,@tunguz @ylecun This is an interesting read:\n...,10756512,1203211859366576128,1.203276649049002e+18,tunguz
2,2019-12-07 08:36:45+00:00,1203231861964771328,@ylecun Learning algorithms are designed by pe...,2883830368,1203211859366576128,1.203211859366576e+18,ylecun
3,2019-12-07 07:23:59+00:00,1203213550900170752,@ylecun But I just hope biased data will not b...,1309820665,1203211859366576128,1.203211859366576e+18,ylecun
4,2019-12-07 16:30:43+00:00,1203351140324110336,"@FulvioFlamini @ylecun In the best scenarios, ...",1149835788,1203211859366576128,1.2032318619647713e+18,FulvioFlamini


Make sure we have only one conversation in our data

In [7]:
sample = sample.loc[sample.conversation_id_str == 1203211859366576128]

In [8]:
sample = sample.loc[sample.id_str != 1203416002626805760]

In [13]:
sample.count()

created_at                   146
id_str                       146
text                         146
user_id_str                  146
conversation_id_str          146
in_reply_to_status_id_str    146
in_reply_to_screen_name      145
dtype: int64

Single conversation parsing

In [9]:
class Twitterconversationreader(ConversationParser[pd.DataFrame, pd.Series]):

    def __init__(self):
        super().__init__()


    def extract_node_data(self, raw_node: pd.Series) -> NodeData:
        node_id = raw_node.id_str
        author = raw_node.user_id_str
        timestamp = raw_node.created_at
        data = dict(raw_node)
        parent_id = data.get('in_reply_to_status_id_str')
        if parent_id == 'nan':
            parent_id = None
            print(node_id)
            
        return NodeData(node_id, author, timestamp, data, parent_id)
        

    def iter_raw_nodes(self, raw_conversation: pd.DataFrame) -> Iterable[pd.Series]:
        for x in raw_conversation.iterrows():
            yield x[1]

In [10]:
twitter_reader = Twitterconversationreader()

In [11]:
conversation = twitter_reader.parse(sample)

1203211859366576128


In [12]:
list(conversation.iter_conversation())

[(0,
  NodeData(node_id=1203211859366576128, author=48008938, timestamp=Timestamp('2019-12-07 07:17:16+0000', tz='UTC'), data={'created_at': Timestamp('2019-12-07 07:17:16+0000', tz='UTC'), 'id_str': 1203211859366576128, 'text': 'People are biased.\nData is biased, in part because people are biased.\nAlgorithms trained on biased data are biased.… https://t.co/tP2x8DD8Nr', 'user_id_str': 48008938, 'conversation_id_str': 1203211859366576128, 'in_reply_to_status_id_str': 'nan', 'in_reply_to_screen_name': nan}, parent_id=None))]