# Burry

Burry is a project to analyse text data, in the form of tweets, to find relation between those texts and subjects.

It is done as a project for the Fontys University of ICT and the Central Bureau of Statistics of the Netherlands.

Currently it is focussed at detecting sickness.

In [1]:
import pandas as pd
import numpy as np
import operator
import re
from collections import Counter
import gensim
from gensim import corpora, models
from gensim.models.word2vec import Word2Vec

doc = pd.read_csv('sample_dataset.csv', sep=';')

search_terms = [re.sub(r'(src:\w+)', '', word).strip() for word in doc['zoekopdracht'].unique()]
print(search_terms)

doc.head()

['ziek']


Unnamed: 0,zoekopdracht,datum,url,sentiment,type,discussielengte,views,auteur,volgers,invloed,GPS breedtegraad,GPS lengtegraad,bericht tekst,type bron,titel
0,ziek src:twitter,2017-09-20 10:11,https://twitter.com/HubertDeMeulder/status/910...,-,post,,14671.0,HubertDeMeulder,14671,4.8,,,#SabineHagedoren: “Op het werk wist niemand da...,twitter,
1,ziek src:twitter,2017-09-20 10:11,https://twitter.com/michielvdbroeck/status/910...,+,comment,4.0,,michielvdbroeck,1492,1.7,,,@heloisesell Haha wtf ziek nice,twitter,
2,ziek src:twitter,2017-09-20 10:10,https://twitter.com/petervalk1/status/91041589...,-,comment,2.0,1088.0,petervalk1,1088,1.8,,,"RT @AcrisiuS322: Van der Laan is ziek gemeld, ...",twitter,
3,ziek src:twitter,2017-09-20 10:09,https://twitter.com/Korneel_Evers/status/91041...,+,comment,644.0,,Korneel_Evers,4411,21.4,,,@_jazzybelle @zoekpostadres @real_Raffie @olaf...,twitter,
4,ziek src:twitter,2017-09-20 10:07,https://twitter.com/sariehimpe/status/91041518...,-,post,,332.0,sariehimpe,332,0.4,,,Iedereen gaat al een week de tijd gehad hebben...,twitter,


# Cleaning data
Here we do several things:
* remove retweets
* convert text to lowercase

In the context of our research a retweet does not add any relevant data. It actually skews the results since a specific phrasing get's repeated more often than it naturally would.

In [2]:
doc['bericht tekst'] = doc['bericht tekst'].fillna('')

In [3]:
doc['bericht tekst'] = doc['bericht tekst'].str.lower()

In [4]:
doc  = doc[~doc['bericht tekst'].str.contains('rt')]
doc  = doc[~doc['bericht tekst'].str.contains(r'http[s]*')]  # TODO secure?
doc  = doc[~doc['auteur'].str.contains('grieptweets')]  # TODO to file
doc  = doc[~doc['auteur'].str.contains('kleenex_helpt')]
# doc.head()

In [5]:
doc = doc.drop_duplicates()
len(doc)

6530

# Defenition of helper function


In [6]:
re_clean = re.compile(r'(https?://\S+|@\S+)')
re_words = re.compile(r'(\w+-?\w*)')

def clean_text(text: str):
    words = []
    if text:
        text = re_clean.sub(' ', text)
        words = re_words.findall(text)
    return words

In [7]:
all_tweets = pd.read_csv('sample_alltweets.csv', sep=';')
all_tweets.head()

Unnamed: 0,zoekopdracht,datum,url,sentiment,type,discussielengte,views,auteur,volgers,invloed,GPS breedtegraad,GPS lengtegraad,bericht tekst,type bron,titel
0,src:twitter,2017-09-19 13:09,https://twitter.com/Savage_Grethan/status/9100...,,comment,2.0,844.0,Savage_Grethan,844,0.4,,,RT @thedolanclones: SJKSK IM QUAKIHG IN MY BOOTS,twitter,
1,src:twitter,2017-09-20 10:47,https://twitter.com/PolBegov/status/9104252503...,,post,,2779.0,PolBegov,2779,0.9,,,VUB hanteert quotum voor vrouwelijke professor...,twitter,
2,src:twitter,2017-09-29 13:07,https://twitter.com/robbertnootnoot/status/913...,,post,,300.0,robbertnootnoot,300,6.3,,,Thank the Gods we're going to wave goodbye to ...,twitter,
3,src:twitter,2017-10-02 21:09,https://twitter.com/BanessaFabulous/status/914...,,comment,456.0,,BanessaFabulous,102,0.6,,,@nytimes @rcallimachi It really does not matte...,twitter,
4,src:twitter,2017-09-12 07:13,https://twitter.com/Radio_20/status/9074721078...,,post,,2055.0,Radio_20,2055,1.5,,,#Media companies must reimagine their #Data fo...,twitter,


# Refining data
Split a tweet into words. Since individual words are easier to process.

In [8]:
doc['bericht woorden'] = doc['bericht tekst'].map(clean_text)

# Cleaning data
Here we do several things:
* convert messages with no content with to the message `''`
* remove retweets
* convert to lowercase

The conversion is needed since the `clean_text()` function expects a string.

In [9]:
all_tweets['bericht tekst'] = all_tweets['bericht tekst'].fillna('')

In [10]:
all_tweets['bericht tekst'] = all_tweets['bericht tekst'].str.lower()

In [11]:
all_tweets = all_tweets[~all_tweets['bericht tekst'].str.contains('rt')]
all_tweets.head()

Unnamed: 0,zoekopdracht,datum,url,sentiment,type,discussielengte,views,auteur,volgers,invloed,GPS breedtegraad,GPS lengtegraad,bericht tekst,type bron,titel
2,src:twitter,2017-09-29 13:07,https://twitter.com/robbertnootnoot/status/913...,,post,,300.0,robbertnootnoot,300,6.3,,,thank the gods we're going to wave goodbye to ...,twitter,
3,src:twitter,2017-10-02 21:09,https://twitter.com/BanessaFabulous/status/914...,,comment,456.0,,BanessaFabulous,102,0.6,,,@nytimes @rcallimachi it really does not matte...,twitter,
4,src:twitter,2017-09-12 07:13,https://twitter.com/Radio_20/status/9074721078...,,post,,2055.0,Radio_20,2055,1.5,,,#media companies must reimagine their #data fo...,twitter,
7,src:twitter,2017-10-04 15:31,https://twitter.com/Drflsn/status/915569982007...,,comment,,,Drflsn,758,0.3,,,@westerns_sq @skatemlley wellllkammmmmmmm,twitter,
11,src:twitter,2017-09-26 19:18,https://twitter.com/handige_henny/status/91272...,,post,,212.0,handige_henny,212,2.1,,,ik heb een video toegevoegd aan een @youtube-a...,twitter,


In [12]:
all_tweets['bericht woorden'] = all_tweets['bericht tekst'].map(clean_text)

# Learning

Now we start learning from our dataset.

First we will find the most common words in our set, so we know which ones to ignore. Think like "the" and "it" in english.

We will put these word's in our blacklist anlong with our search terms.

We are using the dataset without filters to scan all possible tweets.


In [13]:
execute_learning = False

In [14]:
result = []
if execute_learning: 
    counter = Counter()
    for words in all_tweets['bericht woorden']:
        counter.update(words)
    result = counter.most_common(25)
result

[]

In [15]:
if execute_learning: 
    common_words = set([word[0] for word in counter.most_common(300)])

In [16]:
if execute_learning: 
    blacklisted_words = set(common_words)
    blacklisted_words.update(set(search_terms))

# Related words
Now we will try to find related words to our search terms.

We will be doing this using our blacklist to filter out the blacklisted and common words.

Essentially this also is a common_words list but then to find the relations instead of common words.

These words are not used by the actual model. However they serve as a help for defining relevant search words.

In [17]:
related_words = []
if execute_learning: 
    counter = Counter()

    for words in doc['bericht woorden']:
        words = set(words)
        filtered_words = words - blacklisted_words
        counter.update(filtered_words)

    related_words = counter.most_common(25)
related_words

[]

# Training
Now we will train the model using a dataset from wikipedia.

This dataset is used to replicate normal language and analyse the relations between words.

We have already trained a model and will be loading it in here

In [18]:
model = Word2Vec.load('word2vec.model')

# Analysing
Now we will analyse our dataset using our model.

We will check for similarity of the tweets against our sickness_terms to score a tweet to being from a sick person.


In [None]:
sickness_terms = [
    'ziek',
    'griep',
    'verkouden',
    'verkoudheid',
    'koorts',
    'hoofdpijn',
]

In [20]:
def scorer(row):
    if 'score' not in row:
        score = 0
        words = [word.replace('#', '') for word in row['bericht woorden'] if word in model.wv.vocab]
        if words:
            score = model.wv.n_similarity(sickness_terms, words)
        if row['type'] == 'comment':
            score /= 2
        row['score'] = score
    return row

doc = doc.apply(scorer, axis=1)
doc = doc.sort_values('score', ascending=False)

In [21]:
pd.set_option('display.max_colwidth', 250)
pd.options.display.max_rows = 999

# Visualization
As you can see below, the model is pretty good at finding the relevant tweets.

The problem we face now is that Dutch people like the swear with deceases.
 - Ziek volkje -> sick people (in the sense of not thinking straight)
 - Tering ziek -> Tuberculosis sick (ambiguous: could be really sick, or I can't understand why)

In [22]:
doc.filter(items=['bericht tekst',  'score', 'auteur', 'type']).head(250)

Unnamed: 0,bericht tekst,score,auteur,type
3213,tyfus ziek,0.797106,MilouAmann,post
4668,ziek erge hoofdpijn.. iemand tips?,0.791987,mikexyn,post
9679,voor 2e dag ziek thuis. maag/darmen teveel prikkels. heerst er weer iets? #buikgriep #griep #ziek #ziekthuis #zzm #ziekzwakmisselijk #gatver,0.787678,johnkapjr,post
2185,ziek thuis griep en verkouden 😷,0.762325,sabihaaakdeniz,post
13391,"hoofdpijn, buikpijn, moe tekenen van ziek komen??😞",0.761952,FireflyChar90,post
12865,ziek veel stress,0.759097,cedric_vandun,post
4767,"hij is misschien psychisch ziek ""ingebeelde ziekten""",0.758268,merckxchristin2,post
17373,ziek volkje.,0.744404,ronald_brok,post
17507,tering ziek..,0.741515,sel0reos,post
17929,smerig ziek klotevolk,0.73824,swimmingsigs,post
