# Dota Chat Analysis

## 1. Setting up the data

Setting up the data starts by loading in all the files required, which in this case are only 'chat.csv' and 'match.csv'. These contain the chat messages of each player, which team they were on and whether team 1 (radiant team) won or lost. However, these tables contain too much unnecessary information, so some cleanup is required first. This is handled by the 'merge_data' method, kept in a seperate function for seperation of code. Here, for each match of which some messages are recorded (49771 out of 50000) is stored in a row. A row consists of 4 values: match_id, radiant_win, chat_t1, chat_t2. radiant_win corresponds to whether team 1 won or lost. Prediction becomes a lot easier if both teams' chat messages are analysed, so both are included in the data.

In [1]:
import pandas as pd
chat_data = pd.read_csv('data/chat.csv')
match_data = pd.read_csv('data/match.csv')

In [2]:
chat_data.sort_values(by=['match_id'])

Unnamed: 0,match_id,key,slot,time,unit
0,0,force it,6,-8,6k Slayer
28,0,wahaha,6,2019,6k Slayer
29,0,hah,1,2027,Monkey
30,0,space boys,6,2030,6k Slayer
31,0,haha,6,2103,6k Slayer
32,0,hah,1,2112,Monkey
33,0,hah,1,2133,Monkey
34,0,we are losing,6,2141,6k Slayer
35,0,hahah,1,2142,Monkey
36,0,alche,1,2143,Monkey


In [45]:
def merge_data(chat, match):
    total_chat_t1 = []
    total_chat_t2 = []
    chat_t1 = []
    chat_t2 = []
    matches = []
    wins = []
    curr_match = chat_data['match_id'][0]
    for index, row in chat_data.iterrows():
        if curr_match != row['match_id']:
            wins.append(match['radiant_win'][curr_match])
            matches.append(curr_match)
            total_chat_t1.append(chat_t1)
            total_chat_t2.append(chat_t2)
            chat_t1 = []
            chat_t2 = []
        if row['slot'] <= 4:
            chat_t1.append(row['key'])
        else:
            chat_t2.append(row['key'])
        curr_match = row['match_id']
    wins.append(match['radiant_win'][curr_match])
    matches.append(curr_match)
    total_chat_t1.append(chat_t1)
    total_chat_t2.append(chat_t2)
    print(len(matches[1:]), len(wins[1:]), len(total_chat_t1[1:]), len(total_chat_t2[1:]))
    data = {'match_id': matches, 'radiant_win': wins, 'chat_t1': total_chat_t1, 'chat_t2': total_chat_t2}
    data = pd.DataFrame(data)
    return data

In [46]:
df = merge_data(chat_data, match_data)

49771 49771 49771 49771


In [47]:
df.head()

Unnamed: 0,match_id,radiant_win,chat_t1,chat_t2
0,0,True,"[space created, hah, mvp ulti, hah, fuck my as...","[force it, ez 500, bye, fate, is cruel, sad sp..."
1,1,False,"[lol, gege, our eshaker afk, 4 v 4, it will be...","[why, he afk, ya, gg, commend the solo supp]"
2,2,False,"[no, not my problem...i guess, XD, we will wai...","[w8 1 min plz, 1 min, ok dude, so kind of you,..."
3,3,False,"[we dno, hi, ty, u2, bait, no luck, just skill...","[how long, dude, we wait him 10 mints, .., Wtf..."
4,4,True,"[gamegood, gg, gfg]","[gg, gg wr, sad wr sad life, ez mid for qop]"


## 2. Transforming the data

Now that we have a nice DataFrame with our required data, we can start turning into a set of feature vectors and word vectors. This is because our model will not be able to handle raw strings without conversion. For this purpose, we first scrap an online web page (netlingo.com) with a lot of different abbreviations. This code was not created by us (except for changing case to lower and *** to uck, but it is quite general and applies here quite nicely. The data is then stored in a json file labeled 'slang.json' for later use. Next, we also import emot, a library that can detect emojis and emoticons (useful in our case) in text. From these two sets of dictionaries, we now iterate through the data we generated and start replacing values. This will make our text more generic and close to english. It will also remove a lot of context, which is not important in short messages (most are no longer than 2 words).

In [104]:
def scrap_netlingo():
    import re
    from bs4 import BeautifulSoup
    import requests, json
    resp = requests.get('http://www.netlingo.com/acronyms.php')
    soup = BeautifulSoup(resp.text, "html.parser")
    slangdict= {}
    key=""
    value=""
    for div in soup.findAll('div', attrs={'class':'list_box3'}):
        for li in div.findAll('li'):
            for a in li.findAll('a'):
                key = a.text
                value = li.text.split(key)[1]
                # fuck is always changed to f***, so uck is used instead
                slangdict[key.lower()]= re.sub('\*\*\*', 'uck', value.lower())
    with open('data/slang.json', 'w') as fid:
        json.dump(slangdict,fid,indent=2)

In [99]:
scrap_netlingo()

In [196]:
with open('data/slang.json') as json_file:  
    slang = json.load(json_file)

In [294]:
def replace_slang(sentence):
    import re
    sentence = str(sentence).lower()
    #replace laughter with lol
    sentence = re.sub(r'\b(a*ha+h[ha]*|o?l+o+l+[ol]*)\b', 'lol', sentence)
    #replace question marks with just one
    sentence = re.sub(r'\?+', ' ?', sentence)
    s = sentence.split()
    new_s = []
    for word in s:
        if word in slang:
            if "it means" in slang[word] and "-or-" in slang[word]:
                new_s.append(slang[word][8:slang[word].index("-or-") - 1])
                continue
            if "it means" in slang[word]:
                new_s.append(slang[word][8:])
                continue
            if "-or-" in slang[word]:
                new_s.append(slang[word][:slang[word].index("-or-") - 1])
                continue
            new_s.append(slang[word])
        else:
            new_s.append(word)
    return new_s

def replace_emoticons(sentence):
    import emot
    #assumes the sentence is split already
    new_s = []
    for word in sentence:
        if type(emot.emoticons(word)) == dict and emot.emoticons(word)['mean'] != []:
            new_s.append(''.join(emot.emoticons(word)['mean']))
        else:
            new_s.append(word)
    return new_s

def replace_special_characters(sentence):
    sentence = replace_slang(sentence)
    sentence = replace_emoticons(sentence)
    return sentence

def replace_special_characters_in_list(l):
    new_l = []
    for sentence in l:
        new_l.append(replace_special_characters(sentence))
    return new_l

In [295]:
df_copy = df.copy()

In [296]:
# print(df_copy['chat_t1'][0])
# print(replace_special_characters_in_list(df_copy['chat_t1'][0]))
df_copy['chat_t1'] = df_copy['chat_t1'].apply(replace_special_characters_in_list)
df_copy['chat_t2'] = df_copy['chat_t2'].apply(replace_special_characters_in_list)

------------------

In [297]:
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
#Use the same random state to ensure no difference in rows selected
train_vectors_t1, test_vectors_t1, train_labels_t1, test_labels_t1 = train_test_split(df_copy['chat_t1'], df_copy['radiant_win'], random_state = 123, test_size=0.1)
train_vectors_t2, test_vectors_t2, train_labels_t2, test_labels_t2 = train_test_split(df_copy['chat_t1'], df_copy['radiant_win'], random_state = 123, test_size=0.1)


In [317]:
training_words = []
for index, row in enumerate(train_vectors_t1):
    for sentence in train_vectors_t1.iloc[index]:
        training_words.append(sentence)
    for sentence in train_vectors_t2.iloc[index]:
        training_words.append(sentence)

In [320]:
model = Word2Vec(training_words, window=3, min_count=1)

  


array([-0.18272698, -0.3214452 , -0.43639842,  0.06477252, -0.05136469,
       -0.39871436, -0.9818467 , -1.5398655 ,  0.2590159 , -0.6979984 ,
       -1.3583335 ,  0.00358243, -0.34315267,  0.24765311,  0.7159464 ,
        0.5078792 , -1.0051621 , -0.54715174,  0.35380465, -0.53530425,
       -1.6304233 ,  0.7083461 ,  0.5425161 ,  1.4555992 , -0.8925635 ,
       -0.13102978, -0.19437468, -0.9377924 , -0.609619  , -0.06792807,
        0.4237146 ,  0.43225282, -0.01834104,  0.2349232 , -0.39721063,
        0.5616004 , -0.41507465,  0.7218995 ,  0.9010574 , -0.8152847 ,
       -0.12789036, -0.74315065,  0.01354121, -0.42839342,  0.07912259,
        0.8706909 , -0.55707514, -0.1153044 , -0.12271321, -0.78538334,
        0.06370596, -0.35504368, -0.12932321,  0.55758536, -0.1491564 ,
        0.79510057,  0.10631128, -0.30056137, -0.50617033,  1.227375  ,
        0.53664625,  0.274932  , -0.35916793,  0.87219876,  0.8842256 ,
       -0.40891075,  1.3360981 ,  0.02331601, -0.565052  , -0.02

In [345]:
model.most_similar('great')

  """Entry point for launching an IPython kernel.


[('decent', 0.830140233039856),
 ('good', 0.8273137807846069),
 ('amazing', 0.7752606868743896),
 ('sick', 0.7691141366958618),
 ('awesome', 0.762021541595459),
 ('perfect', 0.7480865120887756),
 ('terrible', 0.746741771697998),
 ('cool', 0.7368814945220947),
 ('big', 0.7329425811767578),
 ('shitty', 0.7326442003250122)]

## Wow, those are some ridiculously good results. Try some yourself, like 'laughing out loud' or 'amazing'! (also notice 'shitty', maybe because of sarcasm?)

## 3. Setting up the model

First, we will try to make use of SVM. Thanks to my previous experiences with it. it has gained my respect as a good start.

In [414]:
def sent2vec(sentence):
    global model
    s = []
    n_unknowns = 0
    for w in sentence:
        if w == "<UNKNOWN>":
#             s.append(np.random.rand(model.vector_size))
            n_unknowns+= 1
        else:
            s.append(model[w])
    vector = []
    if n_unknowns == len(sentence):
        return [0] * model.vector_size
    for i in range(0, len(s[0])):
        count = 0
        for word in s:
            count+= word[i]
        vector.append(count/(len(s)))
    return vector

In [451]:
def to_vector(v):
    all_vectors = []
    for i in range(0,len(v[0])):
        vector = []
        sum = 0
        for j in range(0, len(v)):
            sum+= v[j][i]
        vector.append(sum / len(v))
        all_vectors.append(vector)
    return all_vectors;

In [415]:
chat_t1_vectors = []
chat_t2_vectors = []
for index, row in train_vectors_t1.iteritems():
    chat_t1_vector = []
    chat_t2_vector = []
    if len(train_vectors_t1[index]) == 0:
        chat_t1_vectors.append([0] * 100)
    else:
        for sentence in train_vectors_t1[index]:
            chat_t1_vector.append(sent2vec(sentence))
        chat_t1_vectors.append(to_vector(chat_t1_vector))
    if len(train_vectors_t2[index]) == 0:
        chat_t2_vectors.append([0] * 100)
    else:
        for sentence in train_vectors_t2[index]:
            chat_t2_vector.append(sent2vec(sentence))
        chat_t2_vectors.append(to_vector(chat_t2_vector))

  # Remove the CWD from sys.path while we load stuff.


In [450]:
def to_vector2(v1, v2):
    all_vectors = []
    for i in range(0,len(v1)):
        vector = []
        for j in range(0,len(v1[i])):
            num1 = 0
            if type(v1[i][j]) == list:
                num1 = v1[i][j][0]
            else:
                num1 = v1[i][j]
            num2 = 0
            if type(v2[i][j]) == list:
                num2 = v2[i][j][0]
            else:
                num2 = v2[i][j]
            vector.append((num1 + num2) / 2)
        all_vectors.append(vector)
    return all_vectors;

In [438]:
chat_vectors = to_vector2(chat_t1_vectors, chat_t2_vectors)

## 4. Training the model

In [442]:
from sklearn import svm
clf = svm.SVC(gamma='scale')

In [447]:
clf.fit(chat_vectors, [1 if win == True else 0 for win in train_labels_t1])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## 5. Testing the model

In [453]:
def replace_unknowns(string, model):
    known_words = model.wv.vocab.keys()
    new_string = [];
    for w in string:
        if w not in known_words:
            new_string.append("<UNKNOWN>")
        else:
            new_string.append(w)
    return new_string

In [454]:
chat_t1_vectors_test = []
chat_t2_vectors_test = []
for index, row in test_vectors_t1.iteritems():
    chat_t1_vector_test = []
    chat_t2_vector_test = []
    if len(test_vectors_t1[index]) == 0:
        chat_t1_vectors_test.append([0] * 100)
    else:
        for sentence in test_vectors_t1[index]:
            chat_t1_vector_test.append(sent2vec(replace_unknowns(sentence, model)))
        chat_t1_vectors_test.append(to_vector(chat_t1_vector_test))
    if len(test_vectors_t2[index]) == 0:
        chat_t2_vectors_test.append([0] * 100)
    else:
        for sentence in test_vectors_t2[index]:
            chat_t2_vector_test.append(sent2vec(replace_unknowns(sentence, model)))
        chat_t2_vectors_test.append(to_vector(chat_t2_vector_test))

  # Remove the CWD from sys.path while we load stuff.


In [461]:
chat_vectors_test = to_vector2(chat_t1_vectors_test, chat_t2_vectors_test)

In [462]:
clf.score(chat_vectors_test, [1 if win == True else 0 for win in test_labels_t1])

0.6729610285255122

## Next, let's try logistic regression

In [505]:

from sklearn.linear_model import LogisticRegression
clf2 = LogisticRegression(random_state=0, solver='lbfgs',
                           multi_class='multinomial').fit(chat_vectors, [1 if win == True else 0 for win in train_labels_t1])



In [466]:
result = clf2.predict(chat_vectors_test)
correct = [p == l for p, l in zip(result, test_labels_t1)]

In [467]:
correct.count(True)/len(correct)

0.6504620329449579

Accuracy of 0.65, a bit below SVM's 0.67

## Also maybe XGBoost

In [506]:
import xgboost as xgb
train = xgb.DMatrix(chat_vectors, label=train_labels_t1)
test = xgb.DMatrix(chat_vectors_test, label=test_labels_t1)

In [479]:
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
param['nthread'] = 4
param['eval_metric'] = 'auc'

In [480]:
evallist = [(test, 'eval'), (train, 'train')]

In [484]:
num_round = 100
bst = xgb.train(param, train, num_round, evallist)

[0]	eval-auc:0.636788	train-auc:0.625304
[1]	eval-auc:0.663269	train-auc:0.655561
[2]	eval-auc:0.675408	train-auc:0.667183
[3]	eval-auc:0.68004	train-auc:0.672882
[4]	eval-auc:0.684209	train-auc:0.678041
[5]	eval-auc:0.687914	train-auc:0.682093
[6]	eval-auc:0.688238	train-auc:0.685862
[7]	eval-auc:0.689838	train-auc:0.689198
[8]	eval-auc:0.690777	train-auc:0.691898
[9]	eval-auc:0.69137	train-auc:0.693153
[10]	eval-auc:0.693095	train-auc:0.695434
[11]	eval-auc:0.692787	train-auc:0.697758
[12]	eval-auc:0.691487	train-auc:0.699891
[13]	eval-auc:0.692141	train-auc:0.701603
[14]	eval-auc:0.694489	train-auc:0.703251
[15]	eval-auc:0.693896	train-auc:0.704224
[16]	eval-auc:0.69447	train-auc:0.705511
[17]	eval-auc:0.695058	train-auc:0.707486
[18]	eval-auc:0.695404	train-auc:0.708936
[19]	eval-auc:0.693989	train-auc:0.710566
[20]	eval-auc:0.693473	train-auc:0.71226
[21]	eval-auc:0.693285	train-auc:0.713064
[22]	eval-auc:0.69335	train-auc:0.714173
[23]	eval-auc:0.696076	train-auc:0.715231
[24]	ev

## Training seems to be going quite slowly, but AUC is gradually increasing for the test set