# Data Layout

Collection of chat log of 2,162 Twitch streaming videos by 52 streamers. Time period of target streaming video is from 2018-04-24 to 2018-06-24.

## Description of columns follows below:
### body: Actual text for user chat
### channel_id: Channel identifier (integer)
### commenter_id: User identifier (integer)
### commenter_type: User type (character)
### created_at: Time of when chat was entered (ISO 8601 date and time)
### fragments: Chat text including parsing information of Twitch emote (JSON list)
### offset: Time offset between start time of video stream and the time of when chat was entered (float)
### updated_at: Time of when chat was edited (ISO 8601 date and time)
### video_id: Video identifier (integer)

#### File name indicates name of Twitch stream channel.
#### This dataset is saved as python3 pandas.DataFrame with python pickle format.
#### import pandas as pd
#### pd.read_pickle('Twitch_data/ICWSM19_data/ninja.pkl')

courtesy of https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VE0IVQ

#### Think about this:
# What is the purpose of this? -- clean spam on the fly? recap chat? something else?
##### - 
##### -
# Todo (codewise):
### - Pull in data using pickle
### - Identify relevant chat (either selecting best chat each minute or best chats overall)
### - Trim out irrelevant chat


In [2]:
import pandas as pd
import re
#ninja_chat = pd.read_pickle('Twitch_data/ICWSM19_data/ninja.pkl')
chocobars_chat = pd.read_pickle('Twitch_data/ICWSM19_data/xchocobars.pkl')
#xqc_chat = pd.read_pickle('Twitch_data/ICWSM19_data/xqcow.pkl')
#tyler1_chat = pd.read_pickle('Twitch_data/ICWSM19_data/loltyler1.pkl')

In [3]:
chocobars_chat

Unnamed: 0,body,channel_id,commenter_id,commenter_type,created_at,fragments,offset,updated_at,video_id
0,VoHiYo,42583390,21074622,user,2018-06-15T22:20:35.628Z,[{'emoticon_id': '81274'}],2.928,2018-06-15T22:20:35.628Z,273695352
1,hey janet,42583390,128772138,user,2018-06-15T22:20:35.698Z,[{'text': 'hey janet'}],2.998,2018-06-15T22:20:35.698Z,273695352
2,Make sure to check out Janet's newest video! Y...,42583390,1564983,user,2018-06-15T22:20:36.003Z,[{'text': 'Make sure to check out Janet's newe...,3.303,2018-06-15T22:20:36.003Z,273695352
3,POGGERS LIVE,42583390,30560738,user,2018-06-15T22:20:36.682Z,[{'text': 'POGGERS LIVE'}],3.982,2018-06-15T22:20:36.682Z,273695352
4,LIVE POGGERS,42583390,30560738,user,2018-06-15T22:20:38.942Z,[{'text': 'LIVE POGGERS'}],6.242,2018-06-15T22:20:38.942Z,273695352
5,LIVE POGGERS,42583390,36677871,user,2018-06-15T22:20:39.285Z,[{'text': 'LIVE POGGERS'}],6.585,2018-06-15T22:20:39.285Z,273695352
6,solo squading?,42583390,58534036,user,2018-06-15T22:20:39.858Z,[{'text': 'solo squading?'}],7.158,2018-06-15T22:20:39.858Z,273695352
7,chocoWave,42583390,38404293,user,2018-06-15T22:20:40.119Z,[{'emoticon_id': '17662'}],7.419,2018-06-15T22:20:40.119Z,273695352
8,LIVE POGGERS,42583390,44498776,user,2018-06-15T22:20:40.14Z,[{'text': 'LIVE POGGERS'}],7.440,2018-06-15T22:20:40.14Z,273695352
9,chcooWave,42583390,94504639,user,2018-06-15T22:20:41.251Z,[{'text': 'chcooWave'}],8.551,2018-06-15T22:20:41.251Z,273695352


### Identify chat as spam or not

In [4]:
#Returns a value between 0 and 1 where 1 indicates a high likelihood that the chat is spam
def spam_score(message):
    #print(message['body'])
    #print(message['fragments'])
    message_fragments = message['fragments']
    return min(emote_ratio(message_fragments) + caps_ratio(message_fragments), 1)

In [5]:
#Returns the ratio of capital to lower case letters (in text fragments only)
def caps_ratio(message_fragments):
    lowercase_count = 0
    uppercase_special_count = 0
    special_char_count = 0
    #number_count = 0
    for fragment in message_fragments:
        #A fragment is a dictionary of 1 key value pair
        for key in fragment.keys():
            if key == "text":
                uppercase_special_count += len(re.findall(r"[A-Z]", fragment["text"]))
                lowercase_count += len(re.findall(r"[a-z]", fragment["text"]))
                #number_count += len(re.findall(r"[0-9]", string))
                uppercase_special_count += len(re.findall(r"[,.!?]", fragment["text"]))
    #print(uppercase_special_count, lowercase_count)
    return (uppercase_special_count / max(1, lowercase_count + uppercase_special_count))

In [20]:
#Returns the ratio of emotes to fragments of text
def emote_ratio(message_fragments):
    emote_count = 0
    non_emote_count = 0
    for fragment in message_fragments:
        #A fragment is a dictionary of 1 key value pair
        for key in fragment.keys():
            if key == "emoticon_id" or fragment["text"] == "PepeHands":
                emote_count += 1
            elif key == "text" and fragment["text"] != "PepeHands":
                if fragment["text"] == " ":
                    continue
                else:
                    non_emote_count += 1
    #print(emote_count, non_emote_count)
    return (emote_count / (non_emote_count + 1)) / max(emote_count, 1)

# Potential Use Case 1: Identify chatters who are constantly spamming

In [21]:
#avg_spam_score_dict = {}
#sum_spam_score = 0
#num_messages = 0
non_spam_messages = []
spam_messages = []
i = 0
non_spam = 0
for i, (index, row) in enumerate(chocobars_chat.iterrows()):
    if spam_score(row) < 0.25:
        non_spam += 1
        non_spam_messages.append(row['body'])
        #print(row['body'])
    else:
        spam_messages.append(row['body'])
    if i > 1000:
        print(non_spam, 1000-non_spam)
        break
    #if row['commenter_id'] == "50192778":
    #    print(row['body'])
    #    sum_spam_score += spam_score(row)
    #    num_messages += 1
#avg_spam_score = sum_spam_score / num_messages
#for i in range(100000, 100050):
#    print(spam_score(chocobars_chat.iloc(0)[i]))
#    print("\n")

365 635


In [22]:
spam_messages

['VoHiYo',
 "Make sure to check out Janet's newest video! Youtube.com/watch?v=h1ce_NgBVyY - [PART 8] THE LAST SV VIDEO | XCHOCOBARS STARDEW VALLEY",
 'POGGERS LIVE',
 'LIVE POGGERS',
 'LIVE POGGERS',
 'chocoWave',
 'LIVE POGGERS',
 'live POGGERS',
 'POGGERS',
 'POGGERS',
 'COGGERS',
 'chocoHug',
 'LIVE POGGERS',
 'LIVE POGGERS',
 'chocoWave chocoH chocoWave chocoH chocoWave chocoH chocoWave chocoH chocoWave chocoH chocoWave chocoH chocoWave chocoH chocoWave chocoH chocoWave',
 'chocoWave chocoWave chocoWave chocoWave',
 'chocoWave',
 'POGGERS',
 'chocoWave chocoWave chocoWave chocoWave',
 'POGGERS',
 'POGGERS',
 'POGGERS',
 'LIVE POGGERS',
 'LIVE POGGERS',
 'chocoWave chocoH',
 '@xChocoBars chocoWave HI~',
 'LIVE POGGERS',
 'chocoWave',
 'chocoWave chocoWave chocoWave chocoWave chocoWave chocoWave',
 'chocoWave chocoH',
 'chocoWave hi janet',
 'hi janet chocoWave chocoH',
 'POGGERS',
 'chocoWave',
 'training arc POGGERS',
 'chocoH chocoH chocoWave',
 'POGGERS',
 'chocoWave chocoWave ch

In [23]:
non_spam_messages

['hey janet',
 'solo squading?',
 'chcooWave',
 'hi',
 'What’s goodie',
 'Hellooooo',
 'your top looks like an Arizona Ice Tea can',
 'i thought was Jebait',
 'Wanna play duos',
 'HIii janet',
 'You look pretty today Janet',
 'looking good',
 'hi hi hiiii',
 'You and jerry should do the 2v2 20k tourney next time!',
 'cheer1000 Bring home a W thanks.',
 '!prime',
 'If you are a Twitch Prime user, did you know you get a FREE subscription that you can use on Janet? Want to learn how to get Twitch Prime and what else you get along with it? Click here to find out: https://streamable.com/2j1qw',
 'hi',
 'monkaS',
 'football Pog',
 'thats adorable',
 'ddu du ddu du',
 'stitches81 Pog',
 "cheer1000 I'll settle for a few K's instead of a W.",
 'we r family?aAwwwww',
 'pog',
 'chocoWave',
 'pregnant with toast probably',
 'wth',
 '!newvid',
 'wtf',
 'the fuck is wrong with some of you',
 'jesus wtf',
 'wtf chat',
 'wtf chat',
 'stop',
 'Yikes',
 'wtf chat',
 'oof',
 '!social',
 '♡ https://www.tw