# Test wiki lexicon
In which we determine how useful the lexicon of wikipedia references (developed by [Spitkovsky and Chang (2011)](https://nlp.stanford.edu/pubs/crosswikis.pdf)) really is.

In [1]:
from bz2 import BZ2File
import json
import pandas as pd

## Load data

In [12]:
dict_file_name = '/hg190/corpora/crosswikis-data.tar.bz2/dictionary.bz2'
cutoff = 100
ctr = 0
with BZ2File(dict_file_name, 'r') as dict_file:
    for l in dict_file:
        l = l.strip()
        print(l.split(' '))
        ctr += 1
        if(ctr >= cutoff):
            break

['0.454197', ',_Saskatchewan', 'R09', 'w:1066/2347']
['0.184065', 'List_of_countries_and_capitals_in_native_languages', 'L', 'W08', 'W09', 'WDB', 'l', 'w:432/2347']
['0.0528334', 'railway_station', 'W09', 'w:124/2347']
['0.0204516', 'Manitoba', 'D', 'KB', 'W08', 'W09', 'WDB', 'w:48/2347']
['0.0178952', 'Plus_and_minus_signs', 'W08', 'W09', 'WDB', 'w:42/2347']
['0.0127823', 'District', 'W08', 'W09', 'WDB', 'w:30/2347']
['0.00894759', 'Department', 'D', 'W08', 'W09', 'WDB', 'w:21/2347']
['0.00681721', 'List_of_Saskatchewan_provincial_highways', 'L', 'W08', 'W09', 'WDB', 'l', 'w:16/2347']
['0.00596506', 'Wicket-keeper', 'W08', 'W09', 'WDB', 'w:14/2347']
['0.00553899', 'Richard_Pankhurst_(academic)', 'W09', 'WDB', 'w:13/2347']
['0.00511291', 'National_Parks_and_Wildlife_Service_(New_South_Wales)', 'D', 'W08', 'W09', 'WDB', 'w:12/2347']
['0.00426076', 'Doubleday_(publisher)', 'KB', 'W08', 'W09', 'WDB', 'w:10/2347']
['0.00426076', 'FC', 'D', 'W08', 'W09', 'WDB', 'w:10/2347']
['0.00383468', '

Looks like that file contains the actual URLs. We need the string mentions!

According to [this](https://nlp.stanford.edu/data/crosswikis-data.tar.bz2/READ_ME.txt), the mention string might be blank. So if the first chunk of the string is not a probability then it's probably a mention string.

In [24]:
import re
ctr = 0
cutoff = 1000
# if the first string is not just a probability then it's a string mention
with BZ2File(dict_file_name, 'r') as dict_file:
    for l in dict_file:
        l = l.strip()
        if(len(l.split('\t')) > 1):
            ctr += 1
            print(l)
        if(ctr >= cutoff):
            break

_og_ eir	0 Þú_og_þeir_(Sókrates) KB W09 W08 WDB c
_og_ eir_(s krates)	0 Þú_og_þeir_(Sókrates) KB W09 W08 WDB t
_veistu_svari 	0 Þá_veistu_svarið KB W08 W09 WDB c t
elstan_i	0 Æthelstan_of_Wessex W08 W09 WDB c t
ey	0 EY D W08 W09 WDB c t
igo_arista	0 Íñigo_Arista_of_Pamplona W08 W09 WDB c t
igo_cervantes-hueg n	0 Iñigo_Cervantes-Huegun c t
igo_cuesta	0 Íñigo_Cuesta W08 W09 WDB c t
igo_fern ndez	0 Íñigo_Fernández NR RWB W08 W09 WDB c t
igo_fern ndez_de_velasco,_2nd_duke_of_fr as	0 Íñigo_Fernández NR RWB W08 W09 WDB c t
igo_l pez	0 Íñigo_López c t
igo_l pez_de_mendoza	0 Íñigo_López_de_Mendoza,_marqués_de_Santillana NR RWB W08 W09 WDB c t
igo_l pez_de_mendoza_y_z  iga	0 Íñigo_López_de_Mendoza_y_Zúñiga W08 W09 WDB c t
igo_m ndez_de_vigo	0 Íñigo_Méndez_de_Vigo W08 W09 WDB c t
igo_melchor_de_velasco,_7th_duke_of_fr as	0 Íñigo_Melchor_de_Velasco,_7th_Duke_of_Frias NR RWB W08 W09 WDB c t
igo_v lez_de_guevara,_8th_count_of_o ate	0 Íñigo_Vélez_de_Guevara,_8th_Count_of_Oñate W08 W09 WDB c t
igo_v 

This looks good! Let's collect all the possible entity mentions.

In [3]:
from collections import Counter
string_mentions = Counter()
# 30 minutes to load ;_;
dict_file_name = '/hg190/corpora/crosswikis-data.tar.bz2/dictionary.bz2'
with BZ2File(dict_file_name, 'r') as dict_file:
    line_generator = (l for l in dict_file if len(l.split('\t')) > 0)
    for i, l in enumerate(line_generator):
        string_mentions[l.split('\t')[0]] += 1
        if(i % 1000000 == 0):
            print('processed %d dict lines'%(i))

processed 0 dict lines
processed 1000000 dict lines
processed 2000000 dict lines
processed 3000000 dict lines
processed 4000000 dict lines
processed 5000000 dict lines
processed 6000000 dict lines
processed 7000000 dict lines
processed 8000000 dict lines
processed 9000000 dict lines
processed 10000000 dict lines
processed 11000000 dict lines
processed 12000000 dict lines
processed 13000000 dict lines
processed 14000000 dict lines
processed 15000000 dict lines
processed 16000000 dict lines
processed 17000000 dict lines
processed 18000000 dict lines
processed 19000000 dict lines
processed 20000000 dict lines
processed 21000000 dict lines
processed 22000000 dict lines
processed 23000000 dict lines
processed 24000000 dict lines
processed 25000000 dict lines
processed 26000000 dict lines
processed 27000000 dict lines
processed 28000000 dict lines
processed 29000000 dict lines
processed 30000000 dict lines
processed 31000000 dict lines
processed 32000000 dict lines
processed 33000000 dict li

In [27]:
print(len(string_mentions))

175335446


In [32]:
top_k = 100
print('\n'.join(map(lambda x: '='.join([x[0], str(x[1])]), string_mentions.most_common(top_k))))

View original Wikipedia Article=3436372
Wikipedia=3397331
Wikipedia article=3092369
View on Wikipedia=3016510
original Wikipedia article=2913987
View article on Wikipedia=2832128
here=2502891
wikipedia=2126682
View article on Wikipedia »=2087905
Source: Wikipedia=1824735
en.wikipedia.org=1524449
http://en, uh-hah-hah-hah.wikipedia.org=1484944
View this article at Wikipedia=1449314
Desktop View=1292791
English=1232446
Original document=1197531
Read=1016083
More from Wikipedia.org »=1013554
view article=805097
위키피디아에서 보기=783987
click here=678243
ウィキペディアを見る=608600
Français=604819
Images from Wikipedia=597798
Deutsch=580641
Article on Wikipedia=553388
Version Imprimable=494283
Read Article on Wikipedia=479402
More @Wikipedia=467159
Wikipedia.org=457026
Wikipedię=456533
Read article at Wikipedia=449392
wikipedia.de=438921
contenido=438163
Italiano=436802
Quelle: Wikipedia=436036
Español=434783
source=425356
Original Page=418075
Source=413224
Wikipedia Article=412082
Nederlands=410842
cette 

How many of these mentions are location-specific?

In [9]:
# write to file first lol
import pandas as pd
from bz2 import BZ2File
from itertools import izip
string_mentions_series = pd.Series(string_mentions).sort_values(inplace=False, ascending=False)
string_mention_file_name = '/hg190/corpora/crosswikis-data.tar.bz2/string_mention_counts.bz2'
with BZ2File(string_mention_file_name, 'w') as string_mention_file:
    for i_str, i_count in izip(string_mentions_series.index, string_mentions_series):
        string_mention_file.write('%s,%d\n'%(i_str, i_count))

In [13]:
print(len(string_mentions_series))

175335446


In [10]:
import pandas as pd
import re
string_mentions_series = pd.Series(string_mentions)
string_mentions_idx = string_mentions_series.index.tolist()
location_words = ['street', 'road', 'house', 'building', 'neighborhood', 'town', 'city', 'county']
location_matcher = re.compile('|'.join(location_words))
string_mentions_location = filter(lambda x: len(location_matcher.findall(x.lower())) > 0, string_mentions_idx)
string_mentions_location = string_mentions_series.loc[string_mentions_location].sort_values(inplace=False, ascending=False)
print(string_mentions_location.head(100))

City Info                                          5026
County Page at Wikipedia                           2200
city overview                                      2138
city                                               1761
town                                               1708
rcity overview                                     1675
Cities and towns                                   1629
About This City                                     833
City                                                797
Town Info                                           770
House                                               725
Twin towns                                          675
house                                               650
New York City                                       622
Cities/Towns/Townships                              574
Buildings                                           574
county                                              560
4 Cities and towns                              

## Build lexicon extractor
Let's build a giant regex to extract potential entities from text, because that's what previous work seems to rely on.

In [17]:
print(string_mentions_series.index[-10:])

Index([u'󾮗デューク', u'􀂃 Consejo Escolar', u'�re', u'�verfurir', u'�verste',
       u'�re', u'�', u'�r�n', u'�stra', u'�orramatur'],
      dtype='object')


In [24]:
print('blah'.encode('utf-8'))

blah


In [74]:
# filter functions
def test_unicode(x):
    try:
        x.encode('utf-8')
        return True
    except Exception, e:
        return False
MAX_LENGTH=5
def test_length(x):
    x_tokens = x.split(' ')
    return len(x_tokens) <= MAX_LENGTH
noise_chars = ['\|', '\(', '\)', '\+', '\*', '?', '\[', '\]', '\^', '\$']
noise_finder = re.compile('[%s]'%(noise_chars))
def clean_mention(x):
    x_clean = x.strip()
    x_noise = noise_finder.findall(x_clean)
    for n in x_noise:
        x_clean.replace(n, '\%s'%(n))
    return x_clean
def test_regex(x):
    try:
        re.compile(x)
        return True
    except Exception, e:
        return False

In [76]:
# string_mentions = string_mentions_series.index.tolist()
print('start with %d string mentions'%(len(string_mentions)))
string_mentions_clean = map(clean_mention, string_mentions)
# unicode
string_mentions_filtered = filter(test_unicode, string_mentions_clean)
print('now have %d filtered string mentions after Unicode'%(len(string_mentions_filtered)))
# length
string_mentions_filtered = filter(test_length, string_mentions_filtered)
print('now have %d filtered string mentions after length'%(len(string_mentions_filtered)))
# regex
string_mentions_filtered = filter(test_regex, string_mentions_filtered)

start with 175335446 string mentions
now have 143451251 filtered string mentions after Unicode
now have 122270880 filtered string mentions after length


KeyboardInterrupt: 

In [77]:
print('ending up with %d filtered mentions'%(len(string_mentions_filtered)))

ending up with 122270880 filtered mentions


Let's get a wide sample of these mentions.

In [78]:
chunks = 10
chunk_size = int(len(string_mentions_filtered) / chunks)
top_k = 20
for i in range(chunks):
    print(string_mentions_filtered[i*chunk_size:i*chunk_size+top_k])

['', '', '_og_ eir', '_og_ eir_(s krates)', '_veistu_svari', 'elstan_i', 'ey', 'igo_arista', 'igo_cervantes-hueg n', 'igo_cuesta', 'igo_fern ndez', 'igo_fern ndez_de_velasco,_2nd_duke_of_fr as', 'igo_l pez', 'igo_l pez_de_mendoza', 'igo_l pez_de_mendoza_y_z  iga', 'igo_m ndez_de_vigo', 'igo_melchor_de_velasco,_7th_duke_of_fr as', 'igo_v lez_de_guevara,_8th_count_of_o ate', 'igo_v lez_de_guevara,_count_of_o ate', 'inn_j nsson']
['Annie Awards', 'Annie Awards (Wikipedia)', 'Annie Awards - Wikipedia', 'Annie Awards 1999', 'Annie Awards 2011', 'Annie Awards Wikipedia page.', 'Annie Awards,', 'Annie Awards:', 'Annie B Sweet', 'Annie B.', 'Annie B. Crockett-Stark', 'Annie B. Sweet', 'Annie Baker', 'Annie Balestra', 'Annie Banannie Promo', 'Annie Bandez', 'Annie Baobei', 'Annie Barabe', 'Annie Barker', 'Annie Barnes']
['Etymology and sentiment', 'Etymology and spelling', 'Etymology and spelling variations', 'Etymology and spellings', 'Etymology and spread', 'Etymology and state name', 'Etymol

Are all these strings regex-friendly?

In [None]:
fail_ctr = 0
fail_cutoff = 50
for s in string_mentions_filtered:
    try:
        re.compile(s)
    except Exception, e:
        print('%s failed to compile'%(s))
        fail_ctr += 1
        if(fail_ctr >= fail_cutoff):
            break

In [65]:
lexicon_matcher = re.compile('|'.join(string_mentions_series.index.tolist()))

error: unbalanced parenthesis

Looks like there is some mention string that still makes the regex fail.

Let's try to recreate this failure iteratively.

In [None]:
full_regex = ''
for i, s in enumerate(string_mentions_series.index):
    if(i == 0):
        full_regex = s
    else:
        full_regex += '|%s'%(s)
    try:
        lexicon_matcher = re.compile(full_regex)
    except Exception, e:
        print('regex fails at string mention %s'%(s))
        break

In [68]:
print('processed %d mentions'%(i))

processed 2003 mentions


Update: this ran for an hour over 2000 mentions without any noticeable failure.

I don't think we'll be able to use a regex.

Let's take a subset of this lexicon and use it for text matching.

## Test entity extraction from text
Can we just apply the lexicon to raw Twitter data and extract entities? This seems way too optimistic.

In [69]:
import gzip
test_data_file = '../../data/mined_tweets/HurricaneHarvey_ids_rehydrated_clean.txt.gz'
harvey_txt = [l.strip() for l in gzip.open(test_data_file, 'r')]
print('\n'.join(harvey_txt[:10]))

Somebody tell POTUS historic #HoustonFloods larger than him &amp; FEMA 🙏🏽
@ChrisGroove1 @Bea_Bells We'll keep our paws crossed that that's what #Harvey  does! #nipclub
Why does @realDonaldTrump make #HurricaneHarvey sound like the most successful hurricane in the history of hurricanes???
"The EPA and TCEQ said Saturday that 161 drinking water systems have boil-water notices, and another 52 are shut do…
#EarlyMorningEveningNews w: @DaRealMonieLove: Southeast Texas Slammed By #HurricaneHarvey; Over 5K Evacuees Head To…
The Brookwoods Group team is not in the office. DM if you need to reach me. Or any of us, for that matter. #Houston #Harvey @brookwoodsgroup
Its all about #hurricaneharvey #katrina #texasfloods on this ep on #souncloud #blackmedia #podsincolor #podernfamily
“We are just returning to our houses and trying to figure out what to do next." #HurricaneHarvey #Harvey2017
crimethinc: Our hearts go out to those in the path of #Harvey and all impacted by climate change. We count on 

Just use a sample of the tweets to stay sane.

In [103]:
sample_size = 10000
tweet_sample = pd.np.random.choice(harvey_txt, size=sample_size, replace=False)

In [79]:
lexicon_set = set(string_mentions_filtered)

Extract all 1,2,3-grams from data and compare overlap with lexicon.

In [104]:
print(tweet_sample[0])

#airbnb is offering help for #Harvey evacuees


In [105]:
from nltk.tokenize.casual import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
ngram_range = (1,3)
tokenizer = TweetTokenizer()
cv = CountVectorizer(min_df=1, ngram_range=ngram_range, tokenizer=tokenizer.tokenize)
tweet_dtm = cv.fit_transform(tweet_sample)
ivoc_lookup = {v : k for k,v in cv.vocabulary_.iteritems()}
top_k = 10
top_k_ngrams = [ivoc_lookup[x] for x in pd.np.array(tweet_dtm.sum(axis=0).argsort())[0][-top_k:]]
print(','.join(top_k_ngrams))

!,…,in,of,,,#hurricaneharvey,the,#harvey,to,.


In [106]:
# get overlap
tweet_lexicon_overlap = set(cv.vocabulary_.keys()) & set(lexicon_set)
print('got %d lexicon overlap'%(len(tweet_lexicon_overlap)))

got 24700 lexicon overlap


In [107]:
top_k = 10
print(','.join(sorted(tweet_lexicon_overlap)[:top_k]))

!,! &,! link,"," "," and "," by," here,#,# 1


This isn't great! How can these strings be part of the lexicon?

In [108]:
tweet_lexicon_overlap_sorted = sorted(tweet_lexicon_overlap, key=lambda x: len(x), reverse=True)
print(tweet_lexicon_overlap_sorted[:top_k])

[u'american watchmakers-clockmakers institute', u'conservatives and libertarians', u'federal disaster declaration', u'jefferson county courthouse', u'mental health professionals', u'strategic petroleum reserve', u'emergency operations center', u'emergency management agency', u'international space station', u'major disaster declaration']


These look less bad!

Let's see which were the most common.

In [116]:
from itertools import izip
tweet_ngram_totals = pd.np.array(tweet_dtm.sum(axis=0))[0]
tweet_lexicon_counts = [tweet_ngram_totals[cv.vocabulary_[x]] for x in tweet_lexicon_overlap]
tweet_lexicon_sorted = sorted(izip(tweet_lexicon_overlap, tweet_lexicon_counts), key=lambda x: x[1], reverse=True)

In [119]:
print(tweet_lexicon_sorted[:10])
print(tweet_lexicon_sorted[100:110])
print(tweet_lexicon_sorted[500:510])

[(u'.', 6550), (u'to', 4289), (u'the', 3949), (u',', 3174), (u'of', 2513), (u'in', 2511), (u'!', 2303), (u'for', 2104), (u'a', 2068), (u'and', 1961)]
[(u'there', 210), (u'this is', 208), (u'good', 207), (u'he', 206), (u'been', 206), (u'their', 206), (u'an', 205), (u'everyone', 204), (u'for the', 203), (u'some', 195)]
[(u'continues', 44), (u'resources', 43), (u'americans', 43), (u'lot of', 43), (u'safety', 43), (u'oil', 43), (u'instead', 43), (u'we can', 43), (u'a lot of', 43), (u'info', 43)]


Seems like the entity strings exist mainly on the long tail of mentions.

TODO: kill stop words!!