# disaster_response_messages

The file contains over 11,000 tweets associated with disaster keywords like “crash”, “quarantine”, and “bush fires” as well as the location and keyword itself.

Then the text were manually classified whether the tweet referred to a disaster event or not (a joke with the word or a movie review or something non-disastrous).

The data structure were inherited from Disasters on social media (https://www.figure-eight.com/data-for-everyone/).

In [64]:
import pandas as pd
import numpy as np
import re
import sklearn
import matplotlib.pyplot as plt
import nltk
pd.options.display.min_rows=100


In [65]:
df=pd.read_csv('data/tweets.csv')

df.set_index('id',inplace=True)
df.keyword.value_counts()

thunderstorm            93
flattened               88
mass%20murder           86
stretcher               86
drowning                83
drown                   83
sirens                  83
engulfed                82
fear                    80
obliterate              80
derailment              79
electrocute             77
collision               77
hostage                 76
deluge                  76
derailed                76
deaths                  76
attack                  74
sunk                    74
fatalities              74
airplane%20accident     74
traumatised             74
inundation              72
destroy                 72
damage                  72
crash                   71
inundated               71
death                   71
body%20bag              71
demolished              70
                        ..
survived                30
suicide%20bomber        29
structural%20failure    29
fatal                   29
blaze                   28
thunder                 27
f

In [66]:
# Cleaning Data
# text preprocessing
df['text'].iloc[3]

'Arsonist sets cars ablaze at dealership https://t.co/0gL7NUCPlb https://t.co/u1CcBhOWh9'

In [67]:

df['text']=df['text'].str.strip()
df['text']=df['text'].str.lower()



In [68]:
# Expand the Contractions
# To expand the contraction in English such as we'll -> we will or we shouldn't've -> we should not have.
## !pip install contractions

# I was not able to install contractions

### Remove Noises:
Text data could include various unnecessary characters or punctuation such as URLs, HTML tags, non-ASCII characters, or other special characters (symbols, emojis, and other graphic characters).

In [69]:
# Removing url
#df['text'].str.replace(r'\bhttp://.*\b')
display(df['text'].iloc[3])

'arsonist sets cars ablaze at dealership https://t.co/0gl7nucplb https://t.co/u1ccbhowh9'

In [70]:
def remove_url(text):
    return re.sub(r'https?://.*\b','',text)
df['text']=df['text'].apply(lambda x: remove_url(x))

In [71]:
def remove_html(text):
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return re.sub(html, "", text)
df['text']=df['text'].apply(lambda x : remove_html(x))


In [72]:
def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7f]',r'', text) # or ''.join([x for x in text if x in string.printable]) 
df['text']=df['text'].apply(lambda x: remove_non_ascii(x))

In [73]:
df['text'].iloc[3]

'arsonist sets cars ablaze at dealership '

In [74]:
# Some tweets has emojies. They should be eliminated from the text.
# e.g.
df['text'].loc[17]

'rengoku sets my heart ablaze p.s. i missed this style of coloring i do so here it is c: # '

In [75]:
# The following function revomes the emojis from text. 'apply' function should be used for each text.
def remove_emoji(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [76]:
df['text']=df['text'].apply(lambda x: remove_emoji(x))

In [77]:
df['text'].loc[17]

'rengoku sets my heart ablaze p.s. i missed this style of coloring i do so here it is c: # '

In [78]:
df['text'].loc[11366]

'i feel directly attacked  i consider moonbin  jinjin as my bias and im currently wrecked by rocky i hate this'

In [79]:
import string
string.punctuation
def remove_punc(text):
    return text.translate(text.maketrans('','',string.punctuation))
df['text']=df['text'].apply(lambda x: remove_punc(x))
df['text']

id
0        communal violence in bhainsa telangana stones ...
1        telangana section 144 has been imposed in bhai...
2                 arsonist sets cars ablaze at dealership 
3                 arsonist sets cars ablaze at dealership 
4        lord jesus your love brings freedom and pardon...
5        if this child was chinese this tweet would hav...
6        several houses have been set ablaze in ngemsib...
7        asansol a bjp office in salanpur village was s...
8        national security minister kan dapaahs side ch...
9        this creature whos soul is no longer clarent b...
10       images showing the havoc caused by the cameroo...
11       social media went bananas after chuba hubbard ...
12       hausa youths set area office of apapaiganmu lo...
13       under mamatabanerjee political violence  vanda...
14                   amen set the whole system ablaze man 
15       images showing the havoc caused by the cameroo...
16       no cows today but our local factory is sadly

## Replace the Typos, slang, acronyms or informal abbreviations:
- Replace the Unicode character with equivalent ASCII character (instead of removing)
- Replace the entity references with their actual symbols  instead of removing as HTML tags
- Replace the Typos, slang, acronyms or informal abbreviations - depend on different situations or main topics of the NLP such as finance or medical topics.
- List out all the hashtags/ usernames then replace with equivalent words
- Replace the emoticon/ emoji with equivalant word meaning such as ":)" with "smile" 
- Spelling correction

In [80]:
def other_clean(text):
        
        # Typos, slang and other
        sample_typos_slang = {
                                "w/e": "whatever",
                                "usagov": "usa government",
                                "recentlu": "recently",
                                "ph0tos": "photos",
                                "amirite": "am i right",
                                "exp0sed": "exposed",
                                "<3": "love",
                                "luv": "love",
                                "amageddon": "armageddon",
                                "trfc": "traffic",
                                "16yr": "16 year"
                                }

        # Acronyms
        sample_acronyms =  { 
                            "mh370": "malaysia airlines flight 370",
                            "okwx": "oklahoma city weather",
                            "arwx": "arkansas weather",    
                            "gawx": "georgia weather",  
                            "scwx": "south carolina weather",  
                            "cawx": "california weather",
                            "tnwx": "tennessee weather",
                            "azwx": "arizona weather",  
                            "alwx": "alabama weather",
                            "usnwsgov": "united states national weather service",
                            "2mw": "tomorrow"
                            }

        
        # Some common abbreviations 
        sample_abbr = {
                        "$" : " dollar ",
                        "€" : " euro ",
                        "4ao" : "for adults only",
                        "a.m" : "before midday",
                        "a3" : "anytime anywhere anyplace",
                        "aamof" : "as a matter of fact",
                        "acct" : "account",
                        "adih" : "another day in hell",
                        "afaic" : "as far as i am concerned",
                        "afaict" : "as far as i can tell",
                        "afaik" : "as far as i know",
                        "afair" : "as far as i remember",
                        "afk" : "away from keyboard",
                        "app" : "application",
                        "approx" : "approximately",
                        "apps" : "applications",
                        "asap" : "as soon as possible",
                        "asl" : "age, sex, location",
                        "atk" : "at the keyboard",
                        "ave." : "avenue",
                        "aymm" : "are you my mother",
                        "ayor" : "at your own risk", 
                        "b&b" : "bed and breakfast",
                        "b+b" : "bed and breakfast",
                        "b.c" : "before christ",
                        "b2b" : "business to business",
                        "b2c" : "business to customer",
                        "b4" : "before",
                        "b4n" : "bye for now",
                        "b@u" : "back at you",
                        "bae" : "before anyone else",
                        "bak" : "back at keyboard",
                        "bbbg" : "bye bye be good",
                        "bbc" : "british broadcasting corporation",
                        "bbias" : "be back in a second",
                        "bbl" : "be back later",
                        "bbs" : "be back soon",
                        "be4" : "before",
                        "bfn" : "bye for now",
                        "blvd" : "boulevard",
                        "bout" : "about",
                        "brb" : "be right back",
                        "bros" : "brothers",
                        "brt" : "be right there",
                        "bsaaw" : "big smile and a wink",
                        "btw" : "by the way",
                        "bwl" : "bursting with laughter",
                        "c/o" : "care of",
                        "cet" : "central european time",
                        "cf" : "compare",
                        "cia" : "central intelligence agency",
                        "csl" : "can not stop laughing",
                        "cu" : "see you",
                        "cul8r" : "see you later",
                        "cv" : "curriculum vitae",
                        "cwot" : "complete waste of time",
                        "cya" : "see you",
                        "cyt" : "see you tomorrow",
                        "dae" : "does anyone else",
                        "dbmib" : "do not bother me i am busy",
                        "diy" : "do it yourself",
                        "dm" : "direct message",
                        "dwh" : "during work hours",
                        "e123" : "easy as one two three",
                        "eet" : "eastern european time",
                        "eg" : "example",
                        "embm" : "early morning business meeting",
                        "encl" : "enclosed",
                        "encl." : "enclosed",
                        "etc" : "and so on",
                        "faq" : "frequently asked questions",
                        "fawc" : "for anyone who cares",
                        "fb" : "facebook",
                        "fc" : "fingers crossed",
                        "fig" : "figure",
                        "fimh" : "forever in my heart", 
                        "ft." : "feet",
                        "ft" : "featuring",
                        "ftl" : "for the loss",
                        "ftw" : "for the win",
                        "fwiw" : "for what it is worth",
                        "fyi" : "for your information",
                        "g9" : "genius",
                        "gahoy" : "get a hold of yourself",
                        "gal" : "get a life",
                        "gcse" : "general certificate of secondary education",
                        "gfn" : "gone for now",
                        "gg" : "good game",
                        "gl" : "good luck",
                        "glhf" : "good luck have fun",
                        "gmt" : "greenwich mean time",
                        "gmta" : "great minds think alike",
                        "gn" : "good night",
                        "g.o.a.t" : "greatest of all time",
                        "goat" : "greatest of all time",
                        "goi" : "get over it",
                        "gps" : "global positioning system",
                        "gr8" : "great",
                        "gratz" : "congratulations",
                        "gyal" : "girl",
                        "h&c" : "hot and cold",
                        "hp" : "horsepower",
                        "hr" : "hour",
                        "hrh" : "his royal highness",
                        "ht" : "height",
                        "ibrb" : "i will be right back",
                        "ic" : "i see",
                        "icq" : "i seek you",
                        "icymi" : "in case you missed it",
                        "idc" : "i do not care",
                        "idgadf" : "i do not give a damn fuck",
                        "idgaf" : "i do not give a fuck",
                        "idk" : "i do not know",
                        "ie" : "that is",
                        "i.e" : "that is",
                        "ifyp" : "i feel your pain",
                        "IG" : "instagram",
                        "iirc" : "if i remember correctly",
                        "ilu" : "i love you",
                        "ily" : "i love you",
                        "imho" : "in my humble opinion",
                        "imo" : "in my opinion",
                        "imu" : "i miss you",
                        "iow" : "in other words",
                        "irl" : "in real life",
                        "j4f" : "just for fun",
                        "jic" : "just in case",
                        "jk" : "just kidding",
                        "jsyk" : "just so you know",
                        "l8r" : "later",
                        "lb" : "pound",
                        "lbs" : "pounds",
                        "ldr" : "long distance relationship",
                        "lmao" : "laugh my ass off",
                        "lmfao" : "laugh my fucking ass off",
                        "lol" : "laughing out loud",
                        "ltd" : "limited",
                        "ltns" : "long time no see",
                        "m8" : "mate",
                        "mf" : "motherfucker",
                        "mfs" : "motherfuckers",
                        "mfw" : "my face when",
                        "mofo" : "motherfucker",
                        "mph" : "miles per hour",
                        "mr" : "mister",
                        "mrw" : "my reaction when",
                        "ms" : "miss",
                        "mte" : "my thoughts exactly",
                        "nagi" : "not a good idea",
                        "nbc" : "national broadcasting company",
                        "nbd" : "not big deal",
                        "nfs" : "not for sale",
                        "ngl" : "not going to lie",
                        "nhs" : "national health service",
                        "nrn" : "no reply necessary",
                        "nsfl" : "not safe for life",
                        "nsfw" : "not safe for work",
                        "nth" : "nice to have",
                        "nvr" : "never",
                        "nyc" : "new york city",
                        "oc" : "original content",
                        "og" : "original",
                        "ohp" : "overhead projector",
                        "oic" : "oh i see",
                        "omdb" : "over my dead body",
                        "omg" : "oh my god",
                        "omw" : "on my way",
                        "p.a" : "per annum",
                        "p.m" : "after midday",
                        "pm" : "prime minister",
                        "poc" : "people of color",
                        "pov" : "point of view",
                        "pp" : "pages",
                        "ppl" : "people",
                        "prw" : "parents are watching",
                        "ps" : "postscript",
                        "pt" : "point",
                        "ptb" : "please text back",
                        "pto" : "please turn over",
                        "qpsa" : "what happens", #"que pasa",
                        "ratchet" : "rude",
                        "rbtl" : "read between the lines",
                        "rlrt" : "real life retweet", 
                        "rofl" : "rolling on the floor laughing",
                        "roflol" : "rolling on the floor laughing out loud",
                        "rotflmao" : "rolling on the floor laughing my ass off",
                        "rt" : "retweet",
                        "ruok" : "are you ok",
                        "sfw" : "safe for work",
                        "sk8" : "skate",
                        "smh" : "shake my head",
                        "sq" : "square",
                        "srsly" : "seriously", 
                        "ssdd" : "same stuff different day",
                        "tbh" : "to be honest",
                        "tbs" : "tablespooful",
                        "tbsp" : "tablespooful",
                        "tfw" : "that feeling when",
                        "thks" : "thank you",
                        "tho" : "though",
                        "thx" : "thank you",
                        "tia" : "thanks in advance",
                        "til" : "today i learned",
                        "tl;dr" : "too long i did not read",
                        "tldr" : "too long i did not read",
                        "tmb" : "tweet me back",
                        "tntl" : "trying not to laugh",
                        "ttyl" : "talk to you later",
                        "u" : "you",
                        "u2" : "you too",
                        "u4e" : "yours for ever",
                        "utc" : "coordinated universal time",
                        "w/" : "with",
                        "w/o" : "without",
                        "w8" : "wait",
                        "wassup" : "what is up",
                        "wb" : "welcome back",
                        "wtf" : "what the fuck",
                        "wtg" : "way to go",
                        "wtpa" : "where the party at",
                        "wuf" : "where are you from",
                        "wuzup" : "what is up",
                        "wywh" : "wish you were here",
                        "yd" : "yard",
                        "ygtr" : "you got that right",
                        "ynk" : "you never know",
                        "zzz" : "sleeping bored and tired"
                        }
            
        sample_typos_slang_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_typos_slang.keys()) + r')(?!\w)')
        sample_acronyms_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_acronyms.keys()) + r')(?!\w)')
        sample_abbr_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_abbr.keys()) + r')(?!\w)')
        
        text = sample_typos_slang_pattern.sub(lambda x: sample_typos_slang[x.group()], text)
        text = sample_acronyms_pattern.sub(lambda x: sample_acronyms[x.group()], text)
        text = sample_abbr_pattern.sub(lambda x: sample_abbr[x.group()], text)
        
        return text

In [81]:

df['text']=df['text'].apply(lambda x : other_clean(x))

In [82]:
# removing duplicate rows
df.drop_duplicates(subset=['text'], keep='first',inplace=True)
df.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,ablaze,,communal violence in bhainsa telangana stones ...,1
1,ablaze,,telangana section 144 has been imposed in bhai...,1
2,ablaze,New York City,arsonist sets cars ablaze at dealership,1
4,ablaze,,lord jesus your love brings freedom and pardon...,0
5,ablaze,OC,if this child was chinese this tweet would hav...,0


### Spelling Correction
you can use textblob.TextBlob to correct the typo. But it should be used very carefully as it might change the meaning of the text.

From textblob import TextBlob

e.g.:

print("Test: ", TextBlob("sleapy and tehre is no plaxe I'm gioong to.").correct())

## Text Preprocessing:

In [83]:
# Tekenization:
from nltk import word_tokenize
df['tokenized']=df['text'].apply(word_tokenize)
df.head()

Unnamed: 0_level_0,keyword,location,text,target,tokenized
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,ablaze,,communal violence in bhainsa telangana stones ...,1,"[communal, violence, in, bhainsa, telangana, s..."
1,ablaze,,telangana section 144 has been imposed in bhai...,1,"[telangana, section, 144, has, been, imposed, ..."
2,ablaze,New York City,arsonist sets cars ablaze at dealership,1,"[arsonist, sets, cars, ablaze, at, dealership]"
4,ablaze,,lord jesus your love brings freedom and pardon...,0,"[lord, jesus, your, love, brings, freedom, and..."
5,ablaze,OC,if this child was chinese this tweet would hav...,0,"[if, this, child, was, chinese, this, tweet, w..."


In [84]:
# Removing stop words:
from nltk.corpus import stopwords
#stop=stopwords.words('English')
# df['stop_words_removed']=df['tokenized'].apply(lambda x : [word for word in x if not in stops])

stop = set(stopwords.words('english'))
df['stopwords_removed'] = df['tokenized'].apply(lambda x: [word for word in x if word not in stop])
df.head()

Unnamed: 0_level_0,keyword,location,text,target,tokenized,stopwords_removed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,ablaze,,communal violence in bhainsa telangana stones ...,1,"[communal, violence, in, bhainsa, telangana, s...","[communal, violence, bhainsa, telangana, stone..."
1,ablaze,,telangana section 144 has been imposed in bhai...,1,"[telangana, section, 144, has, been, imposed, ...","[telangana, section, 144, imposed, bhainsa, ja..."
2,ablaze,New York City,arsonist sets cars ablaze at dealership,1,"[arsonist, sets, cars, ablaze, at, dealership]","[arsonist, sets, cars, ablaze, dealership]"
4,ablaze,,lord jesus your love brings freedom and pardon...,0,"[lord, jesus, your, love, brings, freedom, and...","[lord, jesus, love, brings, freedom, pardon, f..."
5,ablaze,OC,if this child was chinese this tweet would hav...,0,"[if, this, child, was, chinese, this, tweet, w...","[child, chinese, tweet, would, gone, viral, so..."


In [85]:
# Stemming:
from nltk.stem import PorterStemmer
def porter_stemmer (text):
    stemmer=nltk.PorterStemmer()
    stems=[stemmer.stem(word) for word in text]
    return stems


In [86]:
df['PorterStemmer']=df['stopwords_removed'].apply(porter_stemmer)
df

Unnamed: 0_level_0,keyword,location,text,target,tokenized,stopwords_removed,PorterStemmer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,ablaze,,communal violence in bhainsa telangana stones ...,1,"[communal, violence, in, bhainsa, telangana, s...","[communal, violence, bhainsa, telangana, stone...","[commun, violenc, bhainsa, telangana, stone, p..."
1,ablaze,,telangana section 144 has been imposed in bhai...,1,"[telangana, section, 144, has, been, imposed, ...","[telangana, section, 144, imposed, bhainsa, ja...","[telangana, section, 144, impos, bhainsa, janu..."
2,ablaze,New York City,arsonist sets cars ablaze at dealership,1,"[arsonist, sets, cars, ablaze, at, dealership]","[arsonist, sets, cars, ablaze, dealership]","[arsonist, set, car, ablaz, dealership]"
4,ablaze,,lord jesus your love brings freedom and pardon...,0,"[lord, jesus, your, love, brings, freedom, and...","[lord, jesus, love, brings, freedom, pardon, f...","[lord, jesu, love, bring, freedom, pardon, fil..."
5,ablaze,OC,if this child was chinese this tweet would hav...,0,"[if, this, child, was, chinese, this, tweet, w...","[child, chinese, tweet, would, gone, viral, so...","[child, chines, tweet, would, gone, viral, soc..."
6,ablaze,"London, England",several houses have been set ablaze in ngemsib...,1,"[several, houses, have, been, set, ablaze, in,...","[several, houses, set, ablaze, ngemsibaa, vill...","[sever, hous, set, ablaz, ngemsibaa, villag, o..."
7,ablaze,Bharat,asansol a bjp office in salanpur village was s...,1,"[asansol, a, bjp, office, in, salanpur, villag...","[asansol, bjp, office, salanpur, village, set,...","[asansol, bjp, offic, salanpur, villag, set, a..."
8,ablaze,"Accra, Ghana",national security minister kan dapaahs side ch...,0,"[national, security, minister, kan, dapaahs, s...","[national, security, minister, kan, dapaahs, s...","[nation, secur, minist, kan, dapaah, side, chi..."
9,ablaze,Searching,this creature whos soul is no longer clarent b...,0,"[this, creature, whos, soul, is, no, longer, c...","[creature, whos, soul, longer, clarent, blue, ...","[creatur, who, soul, longer, clarent, blue, ab..."
10,ablaze,,images showing the havoc caused by the cameroo...,1,"[images, showing, the, havoc, caused, by, the,...","[images, showing, havoc, caused, cameroon, mil...","[imag, show, havoc, caus, cameroon, militari, ..."


### Part of Speach (POS) tagging

In [118]:
from nltk.corpus import wordnet

def defualt_pos_tagger (text):
    
    #tags=[nltk.pos_tag(word) for word in text]
    tags=nltk.pos_tag(text)
    return tags

df['defualt_postag']=df['stopwords_removed'].apply(defualt_pos_tagger)

In [115]:
from nltk.corpus import wordnet
wordnet_map = {"N":wordnet.NOUN, 
               "V":wordnet.VERB, 
               "J":wordnet.ADJ, 
               "R":wordnet.ADV
              }
    
train_sents = brown.tagged_sents(categories='news')
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)

#def pos_tag_wordnet(text, pos_tag_type='pos_tag'):
def pos_tag_wordnet(text):
    pos_tagged_text = t2.tag(text)
    
    # map the pos tagging output with wordnet output 
    pos_tagged_text = [(word, wordnet_map.get(pos_tag[0])) if pos_tag[0] in wordnet_map.keys() else (word, wordnet.NOUN) for (word, pos_tag) in pos_tagged_text ]
    return pos_tagged_text

In [116]:
df['combined_postag_wnet']=df['stopwords_removed'].apply(pos_tag_wordnet)

### Lemmatization

In [108]:
from nltk.stem import WordNetLemmatizer
def lemmatize(text):
    lemmatizer=WordNetLemmatizer()
    lemma=[lemmatizer.lemmatize(word,tag) for word,tag in text]
    return lemma

#### Lemmatization can be done with or without considering pos:

In [112]:
# lemmatization without considering pos:
lemmatizer=WordNetLemmatizer()

df['lemmatized_word_without_pos']=df['stopwords_removed'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

df['lemmatized_word_without_pos']=df['lemmatized_word_without_pos'].apply(lambda x: [word for word in x if word not in stop])

In [126]:
# lemmatization without considering pos:
lemmatizer=WordNetLemmatizer()

df['lemmatized_word_with_pos']=df['combined_postag_wnet'].apply(lambda x: lemmatize(x))

df['lemmatized_word_with_pos']=df['lemmatized_word_with_pos'].apply(lambda x: [word for word in x if word not in stop])
df['lemmatized_text']=[' '.join(map(str,i)) for i in df['lemmatized_word_with_pos']]

In [127]:
df

Unnamed: 0_level_0,keyword,location,text,target,tokenized,stopwords_removed,PorterStemmer,combined_postag_wnet,defualt_postag,lemmatized_word_without_pos,lemmatized_word_with_pos,lemmatized_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,ablaze,,communal violence in bhainsa telangana stones ...,1,"[communal, violence, in, bhainsa, telangana, s...","[communal, violence, bhainsa, telangana, stone...","[commun, violenc, bhainsa, telangana, stone, p...","[(communal, n), (violence, n), (bhainsa, n), (...","[(communal, JJ), (violence, NN), (bhainsa, NN)...","[communal, violence, bhainsa, telangana, stone...","[communal, violence, bhainsa, telangana, stone...",communal violence bhainsa telangana stone pelt...
1,ablaze,,telangana section 144 has been imposed in bhai...,1,"[telangana, section, 144, has, been, imposed, ...","[telangana, section, 144, imposed, bhainsa, ja...","[telangana, section, 144, impos, bhainsa, janu...","[(telangana, n), (section, n), (144, n), (impo...","[(telangana, JJ), (section, NN), (144, CD), (i...","[telangana, section, 144, imposed, bhainsa, ja...","[telangana, section, 144, impose, bhainsa, jan...",telangana section 144 impose bhainsa january 1...
2,ablaze,New York City,arsonist sets cars ablaze at dealership,1,"[arsonist, sets, cars, ablaze, at, dealership]","[arsonist, sets, cars, ablaze, dealership]","[arsonist, set, car, ablaz, dealership]","[(arsonist, n), (sets, v), (cars, n), (ablaze,...","[(arsonist, JJ), (sets, NNS), (cars, NNS), (ab...","[arsonist, set, car, ablaze, dealership]","[arsonist, set, car, ablaze, dealership]",arsonist set car ablaze dealership
4,ablaze,,lord jesus your love brings freedom and pardon...,0,"[lord, jesus, your, love, brings, freedom, and...","[lord, jesus, love, brings, freedom, pardon, f...","[lord, jesu, love, bring, freedom, pardon, fil...","[(lord, n), (jesus, n), (love, v), (brings, v)...","[(lord, NN), (jesus, NN), (love, VBP), (brings...","[lord, jesus, love, brings, freedom, pardon, f...","[lord, jesus, love, bring, freedom, pardon, fi...",lord jesus love bring freedom pardon fill holy...
5,ablaze,OC,if this child was chinese this tweet would hav...,0,"[if, this, child, was, chinese, this, tweet, w...","[child, chinese, tweet, would, gone, viral, so...","[child, chines, tweet, would, gone, viral, soc...","[(child, n), (chinese, n), (tweet, n), (would,...","[(child, NN), (chinese, JJ), (tweet, NN), (wou...","[child, chinese, tweet, would, gone, viral, so...","[child, chinese, tweet, would, go, viral, soci...",child chinese tweet would go viral social medi...
6,ablaze,"London, England",several houses have been set ablaze in ngemsib...,1,"[several, houses, have, been, set, ablaze, in,...","[several, houses, set, ablaze, ngemsibaa, vill...","[sever, hous, set, ablaz, ngemsibaa, villag, o...","[(several, n), (houses, n), (set, v), (ablaze,...","[(several, JJ), (houses, NNS), (set, VBD), (ab...","[several, house, set, ablaze, ngemsibaa, villa...","[several, house, set, ablaze, ngemsibaa, villa...",several house set ablaze ngemsibaa village oku...
7,ablaze,Bharat,asansol a bjp office in salanpur village was s...,1,"[asansol, a, bjp, office, in, salanpur, villag...","[asansol, bjp, office, salanpur, village, set,...","[asansol, bjp, offic, salanpur, villag, set, a...","[(asansol, n), (bjp, n), (office, n), (salanpu...","[(asansol, NNS), (bjp, JJ), (office, NN), (sal...","[asansol, bjp, office, salanpur, village, set,...","[asansol, bjp, office, salanpur, village, set,...",asansol bjp office salanpur village set ablaze...
8,ablaze,"Accra, Ghana",national security minister kan dapaahs side ch...,0,"[national, security, minister, kan, dapaahs, s...","[national, security, minister, kan, dapaahs, s...","[nation, secur, minist, kan, dapaah, side, chi...","[(national, a), (security, n), (minister, n), ...","[(national, JJ), (security, NN), (minister, NN...","[national, security, minister, kan, dapaahs, s...","[national, security, minister, kan, dapaahs, s...",national security minister kan dapaahs side ch...
9,ablaze,Searching,this creature whos soul is no longer clarent b...,0,"[this, creature, whos, soul, is, no, longer, c...","[creature, whos, soul, longer, clarent, blue, ...","[creatur, who, soul, longer, clarent, blue, ab...","[(creature, n), (whos, n), (soul, n), (longer,...","[(creature, NN), (whos, NN), (soul, NN), (long...","[creature, soul, longer, clarent, blue, ablaze...","[creature, soul, longer, clarent, blue, ablaze...",creature soul longer clarent blue ablaze thing...
10,ablaze,,images showing the havoc caused by the cameroo...,1,"[images, showing, the, havoc, caused, by, the,...","[images, showing, havoc, caused, cameroon, mil...","[imag, show, havoc, caus, cameroon, militari, ...","[(images, n), (showing, v), (havoc, n), (cause...","[(images, NNS), (showing, VBG), (havoc, NN), (...","[image, showing, havoc, caused, cameroon, mili...","[image, show, havoc, cause, cameroon, military...",image show havoc cause cameroon military torch...


In [13]:
# We have unbalanced sample. only 17.8% of the comments is related to the disasters.
df['target'].value_counts()/len(df)

0    0.821932
1    0.178068
Name: target, dtype: float64

In [35]:
# Defining X,y
from sklearn.model_selection import train_test_split
text_train, text_test, y_train, y_test= train_test_split(df['text'],df['target'],random_state=42)

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(text_train)
X_train_vect = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train_vect)))

X_train:
<8200x18286 sparse matrix of type '<class 'numpy.int64'>'
	with 121848 stored elements in Compressed Sparse Row format>


In [34]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

# As can be seen in below, 18268 features are a lot. we should find a way to decrease the number of features. 
# But before let see how is the performance of our model.  

Number of features: 18286
Every 2000th feature:
['00', 'becaus', 'cosmopolitan', 'extinct', 'hurst', 'make', 'pauli', 'runners', 'tamara', 'write']


In [25]:
# Lets check LogisticRegression on this data set
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

scores = cross_val_score(LogisticRegression(), X_train_vect, y_train, cv=5,scoring='roc_auc')

print("Mean cross-validation AUC: {:.2f}".format(np.mean(scores)))



Mean cross-validation AUC: 0.88


In [141]:
# Lets decrese the number of features with lemmatization:
#      Do not run this code
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df['text_lemmatized'] = df['text'].apply(lemmatize_text)
#vect = CountVectorizer().fit(text_train)
#X_train = vect.transform(text_train)
df['text_lemmatized'].iloc[:20]

#print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))



In [46]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score, f1_score 
from sklearn.linear_model import LogisticRegression

vect = CountVectorizer(min_df=5).fit(text_train)
print('The number iof features in countVectorizor: ',len(vect.get_feature_names()))

X_train_vectorized = vect.transform(text_train)
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(text_test))

print('CountVectorizer AUC: ', roc_auc_score(y_test, predictions))
print('CountVectorizer f1: ', f1_score(y_test, predictions))

The number iof features in countVectorizor:  3171
CountVectorizer AUC:  0.767954623352087
CountVectorizer f1:  0.6625766871165644


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [47]:
# Lets use tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score 

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
# X_train=df['text_lemmatized']
vect = TfidfVectorizer(min_df=5).fit(text_train)
print('The number of features in tf-idf: ',len(vect.get_feature_names()))

X_train_vectorized = vect.transform(text_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(text_test))

print('tf-idf AUC: ', roc_auc_score(y_test, predictions))
print('tf-idf f1: ', f1_score(y_test, predictions))

The number of features in tf-idf:  3171
tf-idf AUC:  0.6839081580807531
tf-idf f1:  0.527536231884058


In [48]:
# n-gram- count
# lets try to see if the combination of words can improve the performance of the model
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(text_train)

X_train_vectorized = vect.transform(text_train)
print('The number iof features in countVectorizor: ',len(vect.get_feature_names()))
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(text_test))
print('CountVectorizer AUC: ', roc_auc_score(y_test, predictions))
print('tf-idf f1: ', f1_score(y_test, predictions))


The number iof features in countVectorizor:  5470
CountVectorizer AUC:  0.7565863110249129
tf-idf f1:  0.6427688504326329


In [50]:
# n-gram- tf-idf
# lets try to see if the combination of words can improve the performance of the model
# Fit the TfidfVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = TfidfVectorizer(min_df=5, ngram_range=(1,2)).fit(text_train)

X_train_vectorized = vect.transform(text_train)
print('The number iof features in TfidfVectorizor: ',len(vect.get_feature_names()))
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(text_test))
print('tf-idf AUC: ', roc_auc_score(y_test, predictions))
print('tf-idf f1: ', f1_score(y_test, predictions))


The number iof features in TfidfVectorizor:  5470
tf-idf AUC:  0.6813254640350217
tf-idf f1:  0.526002971768202


In [140]:
# lets try to improve the performance by tunning hyperparameters
from sklearn.model_selection import GridSearchCV
pipe=Pipeline([('clf',LogisticRegression())])
param_grid = {'clf__C': [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('clf',
                                        LogisticRegression(C=1.0,
                                                           class_weight=None,
                                                           dual=False,
                                                           fit_intercept=True,
                                                           intercept_scaling=1,
                                                           l1_ratio=None,
                                                           max_iter=100,
                                                           multi_class='auto',
                                                           n_jobs=None,
                                                           penalty='l2',
                                                           random_state=None,
                                                 