# 1. Readings

## 1.1. Text summarizer

Based on https://towardsdatascience.com/summarizing-tweets-in-a-disaster-part-ii-67db021d378d:
- look for situational words, describing situation or casulties using SpaCy (Numerals (eg. number of casualties, important phone numbers); Entities (eg. places, dates, events, organisations, etc.))
    - use entity-types, look for content words
- tf-idf score (rank somthing like "Nepal" highly, but not "the") --> use Textacy
- clean data before tokenizing: abbreviations, misspellings (NLTK has a twitter-specific tokenizer)
- summary of words as an ILP problem

check also the notebooks
- for SpaCy: https://github.com/gabrieltseng/datascience-projects/blob/master/natural_language_processing/twitter_disasters/spaCy/3%20-%20Abstractive%20Summary.ipynb
- for NLTK: https://github.com/gabrieltseng/datascience-projects/blob/master/natural_language_processing/twitter_disasters/NLTK/3%20-%20Abstractive%20Summary.ipynb

IBM Watson research paper
- https://arxiv.org/pdf/1602.06023.pdf

Tensorflow text summarization model
- https://github.com/tensorflow/models/tree/master/research/textsum

API services
- https://smmry.com/api

Facebook AI research: A Neural Attention Model for Abstractive Sentence Summarization
- https://arxiv.org/pdf/1509.00685.pdf

- ideas for overall approach: use occuring tweets as well (e.g. twitter set for wildfire)

## 1.2. Keyword Extraction

- based on https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
- also very interesting points on text pre-processing in here

<img src = "KeyWordExtraction_HighLevel.png" width="500">

## 1.3. Futher NLP Tools

- word embeddings: https://www.wikiwand.com/en/Word_embedding --> check word2vec
- sentiment analysis: https://www.wikiwand.com/en/Sentiment_analysis
    - for background on singular value decomposition https://www.wikiwand.com/en/Singular_value_decomposition
- part-of-speech (POS) tagging
- using word graphs (powerful when there are multiple sentences describing similar situations)
- linguistic quality: compare my sample sentence to "normal" English sentences
    - see also KenLM tool at https://kheafield.com/code/kenlm/
    - and more readings to understand this challenge http://masatohagiwara.net/training-an-n-gram-language-model-and-estimating-sentence-probability.html
    - can be compared to current "correct" American English https://www.english-corpora.org/coca/
- spell checker: https://pypi.org/project/pyspellchecker/
- regular expressions: https://docs.python.org/3/library/re.html
- term frequency * Inverse Document Frequency: https://hackernoon.com/finding-the-most-important-sentences-using-nlp-tf-idf-3065028897a3

### pre-trained language models
- ELMo: https://arxiv.org/abs/1802.05365
- ULMFiT: https://arxiv.org/abs/1801.06146
- OpenAI Transformer: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
- BERT: https://arxiv.org/abs/1810.04805

### NLP trends
- Commonsense Interference like Event2Mind (https://arxiv.org/pdf/1805.06939.pdf) or SWAG (https://arxiv.org/abs/1808.05326)

- summary of trends to be found here: http://ruder.io/10-exciting-ideas-of-2018-in-nlp/

### more research to be done into
- general summarization
- statistical parsing
- knowledge extraction: are 911 calls given in a standard or re-occuring format?

### Summary of current trends in NLP
- https://www.analyticsvidhya.com/blog/2017/10/essential-nlp-guide-data-scientists-top-10-nlp-tasks/ (includes a lot of interesting and helpful links)

## 1.4. DL/ ML tools

- transfer learning: https://machinelearningmastery.com/transfer-learning-for-deep-learning/

# 2. Disaster datasets

## 2.1. Twitter datasets

- https://arxiv.org/abs/1605.05894
- https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2834
- https://dl.acm.org/citation.cfm?id=2914600

## 2.2. Other datasets

- https://data.world/crowdflower/disasters-on-social-media
- collection of different datasets: https://crisisnlp.qcri.org/

## 2.3. Other github links

- Twitter: disaster classification, sentiment analysis, named entity recognition --> https://github.com/glrn/nlp-disaster-analysis
- Natural Language Understanding Bot translating unstructured text into structured data --> https://github.com/Kontikilabs/alter-nlu
- Emogram (Text Analysis for unstructured text): Acronym Resolution, Auto Corect, Key Phrase Extraction, Polarity Detection --> https://github.com/axenhammer/Emogram

# 3. Development

In [2]:
#import 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd 
import bs4 as bs
import nltk
from nltk.tokenize import sent_tokenize # tokenizes sentences
import re
from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

eng_stopwords = stopwords.words('english')

# Load more data

In [42]:
file = pd.read_csv('social media disaster.csv', skip_blank_lines=True, nrows=8000, encoding = "ISO-8859-1")

In [43]:
file_updated = file[file.choose_one.isna() == False]

In [44]:
file_updated

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,choose_one,choose_one:confidence,choose_one_gold,keyword,location,text,tweetid,userid;;;;;;;;;
0,778243823,TRUE,golden,156,,Relevant,1.0000,Relevant,,,Just happened a terrible car crash,1.000000e+00,;;;;;;;;;
1,778243824,TRUE,golden,152,,Relevant,1.0000,Relevant,,,Our Deeds are the Reason of this #earthquake M...,1.300000e+01,;;;;;;;;;
4,778243827,TRUE,golden,138,,Relevant,1.0000,Relevant,,,Forest fire near La Ronge Sask. Canada,1.600000e+01,;;;;;;;;;
5,778243828,TRUE,golden,140,,Relevant,1.0000,Relevant,,,All residents asked to 'shelter in place' are ...,1.700000e+01,;;;;;;;;;
7,778243832,TRUE,golden,151,,Relevant,1.0000,Relevant,,,Just got sent this photo from Ruby #Alaska as ...,1.900000e+01,;;;;;;;;;
8,778243833,TRUE,golden,143,,Relevant,1.0000,Relevant,,,#RockyFire Update => California Hwy. 20 closed...,2.000000e+01,;;;;;;;;;
12,778243836,TRUE,golden,157,,Relevant,1.0000,Relevant,,,Typhoon Soudelor kills 28 in China and Taiwan,1.200000e+01,;;;;;;;;;
13,778243837,TRUE,golden,143,,Relevant,1.0000,Relevant,,,We're shaking...It's an earthquake,2.000000e+00,;;;;;;;;;
16,778243839,TRUE,golden,136,,Relevant,0.9215,Relevant,,,There's an emergency evacuation happening now ...,8.000000e+00,;;;;;;;;;
19,778243841,TRUE,golden,147,,Relevant,0.9603,Relevant,,,Three people died from the heat wave so far,1.000000e+01,;;;;;;;;;


In [39]:
file_updated = file_updated.drop(['_unit_id', '_golden', '_unit_state', '_last_judgment_at', 'choose_one:confidence',
       'choose_one_gold', 'keyword', 'location', 'tweetid',
       'userid;;;;;;;;;'], axis=1)

In [41]:
file_updated[200:250]

Unnamed: 0,_trusted_judgments,choose_one,text
308,5.0,Not Relevant,HAPPENING NOW - HATZOLAH EMS AMBULANCE RESPOND...
309,5.0,Relevant,New Nanotech Device Will Be Able To Target And...
310,5.0,Relevant,http://t.co/FueRk0gWui Twelve feared killed in...
311,7.0,Not Relevant,@vballplaya2296 want me to send you some medi...
312,5.0,Relevant,http://t.co/X5YEUYLT1X Twelve feared killed in...
313,5.0,Relevant,2 held with heroin in ambulance http://t.co/d9...
314,6.0,Relevant,Twelve feared killed in Pakistani air ambulanc...
315,5.0,Relevant,http://t.co/B2FaSrt1tN Twelve feared killed in...
317,6.0,Not Relevant,What's the police or ambulance number in Lesot...
318,5.0,Not Relevant,@medic914 @AACE_org I am surprised we still ca...


## Transcript input

In [16]:
call1_time = "9:35:39"
call1_length = 24 #length in seconds

call1_text = ["OPERATOR: Newtown 911. What's the location of your emergency?", 
              "CALLER: Hi, Sandy Hook School. I think there is somebody shooting in here, in Sandy Hook School.",
              "OPERATOR: O.K. What makes you think that?",
              "CALLER: Because somebody's got a gun. I caught a glimpse of someone, they're running down the hallway.",
              "OPERATOR: Okay.",
              "CALLER: They are still running. They're still shooting.",
              "CALLER: Sandy Hook School, please."]

## Pre-processing the emergency call

In [43]:
operator = []
caller = []

for statement in call1_text:
    if 'OPERATOR' in statement:
        statement = statement.replace("OPERATOR:", "")
        operator.append(statement)
    if 'CALLER' in statement:
        statement = statement.replace("CALLER:", "")
        caller.append(statement)

tsc = ""
for part in caller:
    #remove punctation and special characters
    part = re.sub('[^a-zA-Z ]' ,'',part)
    #all to lower case
    part = part.lower()
    #part_words = part.split()
    tsc += part
       
print(tsc)
tsc_words = tsc.split()
print(tsc_words)

 hi sandy hook school i think there is somebody shooting in here in sandy hook school because somebodys got a gun i caught a glimpse of someone theyre running down the hallway they are still running theyre still shooting sandy hook school please
['hi', 'sandy', 'hook', 'school', 'i', 'think', 'there', 'is', 'somebody', 'shooting', 'in', 'here', 'in', 'sandy', 'hook', 'school', 'because', 'somebodys', 'got', 'a', 'gun', 'i', 'caught', 'a', 'glimpse', 'of', 'someone', 'theyre', 'running', 'down', 'the', 'hallway', 'they', 'are', 'still', 'running', 'theyre', 'still', 'shooting', 'sandy', 'hook', 'school', 'please']


## Correct grammer
correct grammer, something along these lines: https://pypi.org/project/pyspellchecker/

## Named entity recognition/ disambiguiation
- find out name of school, city, street etc.

## Word embeddings

## Sentiment analysis
- - sentiment analysis 
    - check paper at https://www.analyticsvidhya.com/blog/2017/01/sentiment-analysis-of-twitter-posts-on-chennai-floods-using-python/, where sentiment analysis was performed on Chennai flood dataset

## Part of Speech Tagging
For meaning of tags check: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

implement more advanced algorithm that can also identify locations etc. and not only single words

In [41]:
#part of speech tagging - standard NLTK probably not sufficient
token_tag = pos_tag(tsc_words)
print(token_tag)

[('hi', 'NN'), ('sandy', 'NN'), ('hook', 'NN'), ('school', 'NN'), ('i', 'NN'), ('think', 'VBP'), ('there', 'EX'), ('is', 'VBZ'), ('somebody', 'NN'), ('shooting', 'VBG'), ('in', 'IN'), ('here', 'RB'), ('in', 'IN'), ('sandy', 'JJ'), ('hook', 'NN'), ('school', 'NN'), ('because', 'IN'), ('somebodys', 'NN'), ('got', 'VBD'), ('a', 'DT'), ('gun', 'NN'), ('i', 'NN'), ('caught', 'VBD'), ('a', 'DT'), ('glimpse', 'NN'), ('of', 'IN'), ('someone', 'NN'), ('theyre', 'NN'), ('running', 'VBG'), ('down', 'RP'), ('the', 'DT'), ('hallway', 'NN'), ('they', 'PRP'), ('are', 'VBP'), ('still', 'RB'), ('running', 'VBG'), ('theyre', 'NN'), ('still', 'RB'), ('shooting', 'VBG'), ('sandy', 'JJ'), ('hook', 'NN'), ('school', 'NN'), ('please', 'NN')]


In [None]:
#output: counting expressions (like Sandy Hook School or shooting)

## Lemmatization

In [38]:
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

In [39]:
wnl = WordNetLemmatizer()

wnl_stems = []
for pair in token_tag:
    res = wnl.lemmatize(pair[0],pos=get_wordnet_pos(pair[1]))
    wnl_stems.append(res)

print(' '.join(wnl_stems))

hi sandy hook school i think there be somebody shoot in here in sandy hook school because somebody get a gun i catch a glimpse of someone theyre run down the hallway they be still run theyre still shoot sandy hook school please


## Stopwords

In [42]:
tsc_wo_stopwords = [w for w in tsc_words if not w in stopwords.words("english")]
removed_stopwords = [w for w in tsc_words if w in stopwords.words("english")]

print('REVIEW WITHOUT STOPWORDS:')
print(' '.join(tsc_wo_stopwords))
print()
print('Stop words removed', removed_stopwords)
print()
print('NUMBER OF STOPWORDS REMOVED:',len(removed_stopwords))

REVIEW WITHOUT STOPWORDS:
hi sandy hook school think somebody shooting sandy hook school somebodys got gun caught glimpse someone theyre running hallway still running theyre still shooting sandy hook school please

Stop words removed ['i', 'there', 'is', 'in', 'here', 'in', 'because', 'a', 'i', 'a', 'of', 'down', 'the', 'they', 'are']

NUMBER OF STOPWORDS REMOVED: 15


# ---------------------------------------------------------

### here follows a summary of what we extracted from the text (summary, keywords etc.) and how this influences the priority