<a href="https://colab.research.google.com/github/GrzegorzMeller/EventsDetection/blob/master/Wiki_Historical_Event_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are predicting whether a given text (abstract from Wikipedia page) is about historical event or not. 

In [1]:
from google.colab import drive
drive.mount('/amd/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /amd/


In [2]:
!cp /amd/My\ Drive/wiki.csv /content/
!cp /amd/My\ Drive/test_wiki.csv /content/

In [3]:
from IPython.core.debugger import set_trace


import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import time

plt.style.use(style="seaborn")


In [9]:
wiki = pd.read_csv("/content/wiki.csv")
test_wiki = pd.read_csv("/content/test_wiki.csv")

In [10]:
wiki.head()

Unnamed: 0.1,Unnamed: 0,Name,Abstract,Target
0,0,http://dbpedia.org/resource/1098,This article is about the year 1098. Year 1098...,1
1,1,http://dbpedia.org/resource/1948_Arab–Israeli_War,Arab–Israeli War redirects here. For other use...,1
2,2,http://dbpedia.org/resource/Battle_of_Britain,The Battle of Britain German: die Luftschlacht...,1
3,3,http://dbpedia.org/resource/Battle_of_Evesham,The Battle of Evesham 4 August 1265 was one of...,1
4,4,http://dbpedia.org/resource/Battle_of_Kursk,The Battle of Kursk was a Second World War eng...,1


Remove URLs and HTML

In [6]:
import re
import string

def remove_URL(text):
    url = re.compile(r"https?://\S+|www\.\S+")
    return url.sub(r"", text)


def remove_html(text):
    html = re.compile(r"<.*?>")
    return html.sub(r"", text)

def remove_punct(text):
    table = str.maketrans("", "", string.punctuation)
    return text.translate(table)    

In [11]:
wiki["Abstract"] = wiki.Abstract.map(lambda x: remove_URL(x))
wiki["Abstract"] = wiki.Abstract.map(lambda x: remove_html(x))
wiki["Abstract"] = wiki.Abstract.map(lambda x: remove_punct(x))

test_wiki["Abstract"] = test_wiki.Abstract.map(lambda x: remove_URL(x))
test_wiki["Abstract"] = test_wiki.Abstract.map(lambda x: remove_html(x))
test_wiki["Abstract"] = test_wiki.Abstract.map(lambda x: remove_punct(x))

Remove stopwords

In [12]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop = set(stopwords.words("english"))


def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in stop]

    return " ".join(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [13]:
wiki["Abstract"] = wiki["Abstract"].map(remove_stopwords)
test_wiki["Abstract"] = test_wiki["Abstract"].map(remove_stopwords)

In [14]:
wiki.Abstract

0        article year 1098 year 1098 mxcviii common yea...
1        arab–israeli war redirects uses see arab–israe...
2        battle britain german die luftschlacht um engl...
3        battle evesham 4 august 1265 one two main batt...
4        battle kursk second world war engagement germa...
                               ...                        
22956    openbsd journal online newspaper dedicated cov...
22957    pagseguro online mobile paymentbased ecommerce...
22958    popspoken online culture news publication foun...
22959    radiobetacom internet radio portal allows user...
22960    rapidbuyr businesstobusiness dealoftheday ecom...
Name: Abstract, Length: 22961, dtype: object

### Basic NLP

In [15]:
from collections import Counter

# Count unique words
def counter_word(text):
    count = Counter()
    for i in text.values:
        for word in i.split():
            count[word] += 1
    return count

In [16]:
text = wiki.Abstract

counter = counter_word(text)

In [17]:
len(counter)

128803

In [18]:
counter

Counter({'article': 182,
         'year': 1322,
         '1098': 12,
         'mxcviii': 1,
         'common': 2552,
         'starting': 188,
         'friday': 21,
         'link': 131,
         'display': 250,
         'full': 337,
         'calendar': 108,
         'julian': 60,
         'arab–israeli': 32,
         'war': 13867,
         'redirects': 39,
         'uses': 241,
         'see': 633,
         'disambiguation': 57,
         '1948': 158,
         'first': 5566,
         'fought': 4501,
         'state': 1355,
         'israel': 314,
         'military': 2913,
         'coalition': 647,
         'arab': 336,
         'states': 3279,
         'forming': 128,
         'second': 2422,
         'stage': 366,
         'palestine': 139,
         'tension': 64,
         'conflict': 2236,
         'arabs': 117,
         'jews': 91,
         'british': 4839,
         'forces': 7011,
         'ever': 235,
         'since': 1516,
         '1917': 162,
         'balfour': 4,
       

In [19]:
num_words = len(counter)

# Max number of words in a sequence
max_length = 150

Shuffle dataset

In [20]:
from sklearn.utils import shuffle
wiki = shuffle(wiki)

In [21]:
wiki.head()

Unnamed: 0.1,Unnamed: 0,Name,Abstract,Target
8421,8421,http://dbpedia.org/resource/Second_Barons'_War,second barons war 1264–1267 civil war england ...,1
7725,7725,http://dbpedia.org/resource/Banbury_mutiny,banbury mutiny mutiny soldiers english new mod...,1
20774,9441,http://dbpedia.org/resource/Birdman_(film),birdman unexpected virtue ignorance commonly k...,0
19066,7733,http://dbpedia.org/resource/Ehretia_acuminata,ehretia acuminata deciduous tree found japan c...,0
2688,2688,http://dbpedia.org/resource/Battle_of_Brier_Creek,battle brier creek american revolutionary war ...,1


Train / test split

In [22]:
train_size = int(wiki.shape[0] * 0.8)

train_sentences = wiki.Abstract[:train_size]
train_labels = wiki.Target[:train_size]

test_sentences = wiki.Abstract[train_size:]
test_labels = wiki.Target[train_size:]

test2_sentences = test_wiki.Abstract
test2_labels = test_wiki.Target

In [23]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(train_sentences)

Using TensorFlow backend.


In [24]:
word_index = tokenizer.word_index

In [25]:
word_index

{'battle': 1,
 'war': 2,
 'species': 3,
 'also': 4,
 'army': 5,
 'forces': 6,
 'known': 7,
 'first': 8,
 'one': 9,
 'two': 10,
 'family': 11,
 'n': 12,
 'french': 13,
 'british': 14,
 'fought': 15,
 'found': 16,
 'north': 17,
 'south': 18,
 'new': 19,
 'film': 20,
 'genus': 21,
 'small': 22,
 'american': 23,
 'united': 24,
 'siege': 25,
 'took': 26,
 'may': 27,
 'force': 28,
 'name': 29,
 'led': 30,
 'states': 31,
 'part': 32,
 'place': 33,
 'large': 34,
 'world': 35,
 'many': 36,
 'troops': 37,
 'operation': 38,
 'military': 39,
 'years': 40,
 'city': 41,
 'northern': 42,
 'campaign': 43,
 'common': 44,
 'empire': 45,
 'general': 46,
 'three': 47,
 'near': 48,
 'long': 49,
 'second': 50,
 'river': 51,
 'southern': 52,
 'german': 53,
 'including': 54,
 'time': 55,
 'victory': 56,
 'several': 57,
 'called': 58,
 'wars': 59,
 'later': 60,
 'conflict': 61,
 'de': 62,
 'however': 63,
 'major': 64,
 'black': 65,
 'america': 66,
 'western': 67,
 'used': 68,
 'government': 69,
 'although': 70

In [26]:
train_sequences = tokenizer.texts_to_sequences(train_sentences)

In [27]:
train_sequences[0]

[50,
 4156,
 2,
 57620,
 93,
 2,
 331,
 6,
 152,
 4156,
 30,
 3362,
 62,
 6235,
 1437,
 6,
 30,
 477,
 743,
 60,
 743,
 331,
 29,
 388,
 398]

In [28]:
from keras.preprocessing.sequence import pad_sequences

train_padded = pad_sequences(
    train_sequences, maxlen=max_length, padding="post", truncating="post"
)

In [29]:
train_padded[0]

array([   50,  4156,     2, 57620,    93,     2,   331,     6,   152,
        4156,    30,  3362,    62,  6235,  1437,     6,    30,   477,
         743,    60,   743,   331,    29,   388,   398,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

In [30]:
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(
    test_sequences, maxlen=max_length, padding="post", truncating="post"
)

test2_sequences = tokenizer.texts_to_sequences(test2_sentences)
test2_padded = pad_sequences(
    test2_sequences, maxlen=max_length, padding="post", truncating="post"
)

In [31]:
print(wiki.iloc[0].Abstract)
print(train_sequences[0])

second barons war 1264–1267 civil war england forces number barons led simon de montfort royalist forces led prince edward later edward england name henry iii
[50, 4156, 2, 57620, 93, 2, 331, 6, 152, 4156, 30, 3362, 62, 6235, 1437, 6, 30, 477, 743, 60, 743, 331, 29, 388, 398]


Check inverse

In [32]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [33]:
def decode(text):
    return " ".join([reverse_word_index.get(i, "?") for i in text])

In [34]:
decode(train_sequences[0])

'second barons war 1264–1267 civil war england forces number barons led simon de montfort royalist forces led prince edward later edward england name henry iii'

In [35]:
print(f"Shape of train {train_padded.shape}")
print(f"Shape of test {test_padded.shape}")
print(f"Shape of test {test2_padded.shape}")

Shape of train (18368, 150)
Shape of test (4593, 150)
Shape of test (1093, 150)


In [36]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.initializers import Constant
from keras.optimizers import Adam
from keras import metrics

model = Sequential()

model.add(Embedding(num_words, 50, input_length=max_length))
model.add(LSTM(64, dropout=0.1))
model.add(Dense(1, activation="sigmoid"))


optimizer = Adam(learning_rate=3e-4)

model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy", metrics.Precision(), metrics.Recall()])

In [37]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 150, 50)           6440150   
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                29440     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 6,469,655
Trainable params: 6,469,655
Non-trainable params: 0
_________________________________________________________________


In [38]:
history = model.fit(
    train_padded, train_labels, epochs=3, validation_data=(test_padded, test_labels),
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 18368 samples, validate on 4593 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [None]:
model.save('mdl')

In [None]:
!zip -r mdl.zip mdl

  adding: mdl (deflated 17%)


In [None]:
!cp -r mdl.zip /amd/My\ Drive

Test on different data to check if model generalizes well

In [39]:
model.evaluate(test2_padded, test2_labels)



[0.625754629039399, 0.8435498476028442, 0.9115789532661438, 0.7704626321792603]