https://www.kaggle.com/c/nlp-getting-started/overview

Welcome to one of our Getting Started machine learning competitions.
This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

Competition Description
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:


![i](https://storage.googleapis.com/kaggle-media/competitions/tweet_screenshot.png)


The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

Acknowledgments
This dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here.

Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480

Evaluation: https://www.kaggle.com/c/nlp-getting-started/overview/evaluation

Submissions are evaluated using F1 between the predicted and expected answers.

F1 is calculated as follows:
F1 = (2 ∗ precision ∗ recall) / (precision + recall)
where:

precision = TP / (TP + FP)

recall = TP / (TP + FN)

and:

True Positive [TP] = your prediction is 1, and the ground truth is also 1 - you predicted a positive and that's true!

False Positive [FP] = your prediction is 1, and the ground truth is 0 - you predicted a positive, and that's false.

False Negative [FN] = your prediction is 0, and the ground truth is 1 - you predicted a negative, and that's false.


In [1]:
'''
!pip install spacy
!python -m spacy download en
'''

import numpy as np
import pandas as pd
import re
import spacy

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, make_scorer

spacy.load('en_core_web_sm')
from spacy.lang.en import English
nlp = English()
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

pd.options.display.max_rows = None
pd.options.display.width = None
pd.options.display.max_colwidth = -1

In [2]:
train = pd.read_csv("train.csv")
train.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1
3,6,,,"13,000 people receive #wildfires evacuation orders in California",1
4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1


In [3]:
holdout = pd.read_csv("test.csv")
holdout.head(5)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, stay safe everyone."
2,3,,,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all"
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
id          7613 non-null int64
keyword     7552 non-null object
location    5080 non-null object
text        7613 non-null object
target      7613 non-null int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [5]:
train.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [6]:
#Verify missing
print(train.isnull().sum())
print()
print(train.isnull().sum() / len(train))

id          0   
keyword     61  
location    2533
text        0   
target      0   
dtype: int64

id          0.000000
keyword     0.008013
location    0.332720
text        0.000000
target      0.000000
dtype: float64


In [7]:
train['keyword'].value_counts(dropna = False)

NaN                      61
fatalities               45
deluge                   42
armageddon               42
sinking                  41
harm                     41
body%20bags              41
damage                   41
evacuate                 40
collided                 40
twister                  40
windstorm                40
siren                    40
outbreak                 40
fear                     40
wreckage                 39
earthquake               39
hellfire                 39
weapon                   39
explosion                39
collision                39
wrecked                  39
famine                   39
derailment               39
weapons                  39
flames                   39
whirlwind                39
sunk                     39
sinkhole                 39
oil%20spill              38
injury                   38
flooding                 38
blaze                    38
drowned                  38
deaths                   38
hurricane           

In [8]:
train['location'].value_counts(dropna = False)

NaN                                                  2533
USA                                                  104 
New York                                             71  
United States                                        50  
London                                               45  
Canada                                               29  
Nigeria                                              28  
UK                                                   27  
Los Angeles, CA                                      26  
India                                                24  
Mumbai                                               22  
Washington, DC                                       21  
Kenya                                                20  
Worldwide                                            19  
Chicago, IL                                          18  
Australia                                            18  
California                                           17  
New York, NY  

'location' is not a godd variable.
1/3 is null. Moreover there ar a lot of, '?', '??', '???' ...

If we decide to use this variable, it will be necessary to handle abreviations first.

Unfortunately, there are wrong labels in the train dataset.
We will not do anything about it. All dataset have some errors and we hope to do a consistent model that will not be significantly affected by those errors.

In [9]:
#Exemples of wrong labels
wrong_labels = [328,443,513,2619,3640,3900,4342,5781,6552,6554,6570,6701,6702,6729,6861,7226]
train[train['id'].isin(wrong_labels)]

Unnamed: 0,id,keyword,location,text,target
229,328,annihilated,,Ready to get annihilated for the BUCS game,1
301,443,apocalypse,,Short Reading\n\nApocalypse 21:1023 \n\nIn the spirit the angel took me to the top of an enormous high mountain and... http://t.co/v8AfTD9zeZ,1
356,513,army,Studio,But if you build an army of 100 dogs and their leader is a lion all dogs will fight like a lion.,1
1822,2619,crashed,,My iPod crashed..... \n#WeLoveYouLouis \n#MTVHottest One Direction,1
2536,3640,desolation,"Quilmes , Arg",This desperation dislocation\nSeparation condemnation\nRevelation in temptation\nIsolation desolation\nLet it go and so to find away,1
2715,3900,devastated,PG Chillin!,Man Currensy really be talkin that talk... I'd be more devastated if he had a ghostwriter than anybody else....,1
3024,4342,dust%20storm,chicago,Going to a fest? Bring swimming goggles for the dust storm in the circle pit,1
4068,5781,forest%20fires,,Campsite recommendations \nToilets /shower \nPub \nFires \nNo kids \nPizza shop \nForest \nPretty stream \nNo midges\nNo snakes\nThanks ??,1
4609,6552,injury,Saint Paul,My prediction for the Vikings game this Sunday....dont expect a whole lot. Infact I think Zimmer goal is....injury free 1st game,1
4611,6554,injury,,Dante Exum's knee injury could stem Jazz's hoped-for surge back to ... http://t.co/8PIFutrB5U,1


In [10]:
#Verify duplicates
duplicated_text_boolean = train.duplicated(['text'], keep = False) 
duplicated_text = train[['id', 'text', 'target']][duplicated_text_boolean]
print(duplicated_text)

         id  \
40    59      
48    68      
106   156     
115   165     
118   171     
119   172     
147   211     
164   238     
610   881     
624   898     
630   907     
634   916     
1134  1634    
1156  1665    
1172  1689    
1197  1723    
1199  1725    
1201  1727    
1202  1728    
1204  1733    
1213  1750    
1214  1752    
1221  1760    
1222  1761    
1242  1790    
1251  1807    
1331  1922    
1332  1924    
1335  1929    
1343  1941    
1345  1943    
1349  1950    
1356  1957    
1360  1962    
1365  1968    
1623  2346    
1703  2458    
1704  2459    
1725  2488    
1771  2544    
2345  3373    
2352  3387    
2439  3503    
2441  3505    
2449  3517    
2450  3518    
2452  3520    
2454  3522    
2456  3524    
2477  3552    
2646  3798    
2651  3806    
2655  3814    
2666  3828    
2674  3836    
2679  3841    
2719  3905    
2736  3933    
2799  4026    
2816  4049    
2822  4057    
2828  4064    
2830  4068    
2831  4072    
2832  4076    
2833  4077

In [11]:
dups_labels = [898, 907, 916]
train[train['id'].isin(dups_labels)]

Unnamed: 0,id,keyword,location,text,target
624,898,bioterrorism,,To fight bioterrorism sir.,0
630,907,bioterrorism,,To fight bioterrorism sir.,1
634,916,bioterrorism,,To fight bioterrorism sir.,0


The data base has not only duplicates but alse duplicated text with differents target value.
Maybe the difference in the target value is because a difference in the location. However, if this is the case we would need more data to fix the problem.
First we will remove all duplicates with different target value.
After we will remove the duplicated text, staying with only the first observations that appears.

In [12]:
#Remove different target values.
duplicated_text.sort_values(['text', 'target'], inplace = True)
duplicated_text_diff_taget_id = []

#Save one id of text with different target.
for i in range(1, len(duplicated_text)):
    if (duplicated_text['text'].iloc[i] == duplicated_text['text'].iloc[i - 1]) and (duplicated_text['target'].iloc[i] != duplicated_text['target'].iloc[i - 1]):
        duplicated_text_diff_taget_id.append(duplicated_text['id'].iloc[i])

duplicated_text_diff_taget = duplicated_text[duplicated_text['id'].isin(duplicated_text_diff_taget_id)]
duplicated_text_diff_taget_id = []
#Save all ids of text with different target.
for i in range(len(duplicated_text)):
    for j in range(len(duplicated_text_diff_taget)):
        if duplicated_text['text'].iloc[i] == duplicated_text_diff_taget['text'].iloc[j]:
            duplicated_text_diff_taget_id.append(duplicated_text['id'].iloc[i])

train = train[~train['id'].isin(duplicated_text_diff_taget_id)]


#Keep just the first duplicated text
train.drop_duplicates(['text'], inplace = True, keep = 'first') 
train.reset_index(inplace = True)

In [13]:
#Normalize the text
def normalize_text(text):
    text = text.lower()
    
    # Removing URLs. Most of the URL does not have any meaning. At least, only by their names it is impossible to know its significance.
    text = re.sub(r"http\S+", "", text)
    
    #We removed @ and # from the list below because @ may imply a location or person and words with # may have different meaning. 
    punctuations = '!?+&*[]-%.,:\/();$=><|{}^û_1234567890' + "'`"
    for p in punctuations:
        text = text.replace(p, '')
 
    #Remove stop words and tokenization
    text_hashtag = re.sub(r'#(\w+)',r'HashTagHashTag\1',text)  #Keep # words
    filtered_text=[]
    
    for word in nlp(text_hashtag):
        if (word.is_stop == False) and (word.text not in ('s', 'w', '#')) and (word.text.isspace() == False): #SpaCy has a bug with blank spaces.
            word = word.lemma_      #Lemmatization
            word = re.sub(r'HashTagHashTag','#',word) 
            filtered_text.append(word)

    return filtered_text

In [14]:
train["tokenized_clean"] = train["text"].apply(normalize_text)
train.head(5)

Unnamed: 0,index,id,keyword,location,text,target,tokenized_clean
0,0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,"[deeds, reason, #earthquake, allah, forgive]"
1,1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, ronge, sask, canada]"
2,2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,"[residents, asked, shelter, place, notified, officers, evacuation, shelter, place, orders, expected]"
3,3,6,,,"13,000 people receive #wildfires evacuation orders in California",1,"[people, receive, #wildfires, evacuation, orders, california]"
4,4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1,"[got, sent, photo, ruby, #alaska, smoke, #wildfires, pours, school]"


In [15]:
#Count words and remove any word that shows only one time
tokenized_text = train['tokenized_clean']
unique_tokens = []
single_tokens = []
for tokens in tokenized_text:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

            
counts = pd.DataFrame(0, index=np.arange(len(tokenized_text)), columns=unique_tokens)

for index, e in enumerate(tokenized_text):
    for token in e:
        if token in unique_tokens:
            counts.iloc[index][token] += 1

counts.head(5)

Unnamed: 0,shelter,place,evacuation,orders,#wildfires,california,fire,be,people,south,...,ï#hannaph,headquarters,smells,@livingsafely,#ar,#nc,#ok,ssw,anza,glink
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,2,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,1,1,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
#Verify words frequency
word_counts = counts.sum(axis=0)
word_counts.value_counts().sort_index()

2      2181
3      1009
4      603 
5      433 
6      260 
7      190 
8      164 
9      137 
10     106 
11     79  
12     89  
13     73  
14     57  
15     48  
16     58  
17     41  
18     30  
19     41  
20     28  
21     29  
22     29  
23     25  
24     21  
25     20  
26     24  
27     20  
28     23  
29     22  
30     27  
31     28  
32     22  
33     20  
34     21  
35     25  
36     19  
37     17  
38     17  
39     9   
40     12  
41     11  
42     8   
43     10  
44     16  
45     4   
46     8   
47     5   
48     4   
49     8   
50     2   
51     4   
52     7   
53     2   
54     1   
55     3   
56     5   
57     2   
58     2   
59     1   
60     5   
61     2   
63     2   
64     2   
65     1   
66     2   
67     2   
68     2   
70     3   
71     1   
72     3   
73     3   
74     1   
75     1   
76     1   
80     1   
82     1   
83     3   
84     2   
85     3   
86     1   
87     1   
90     1   
92     2   
95     2   
96  

In [17]:
word_counts[word_counts > 50]

california      111
fire            244
be              312
people          187
to              54 
school          65 
car             90 
love            97 
have            72 
man             109
police          137
like            340
know            111
amp             298
time            121
not             500
got             123
accident        82 
near            52 
live            59 
pm              102
home            73 
crash           116
rt              95 
years           80 
damage          53 
life            84 
help            70 
coming          51 
getting         56 
right           68 
going           103
today           86 
great           60 
found           51 
shit            52 
killed          95 
dead            96 
@youtube        83 
emergency       151
new             225
#news           73 
need            72 
good            87 
year            67 
way             76 
work            74 
hit             52 
train           85 
day             113


In [18]:
word_counts[word_counts < 4]

tampa                             2
bago                              2
elizabeth                         2
turkmen                           2
wmv                               2
rene                              2
marker                            3
mooresville                       2
iredell                           2
#traffic                          3
piner                             2
rdhorndale                        2
fyi                               3
cadfyi                            2
@aftershockdelo                   2
scuf                              2
roller                            3
coaster                           2
joel                              3
coahuila                          2
#aircraft                         2
#airplane                         3
strict                            2
liability                         2
context                           3
wedn                              2
#rip                              2
ems                         

In [19]:
#Ignore any word with a counting less than 5 to prevent overfitting. 
counts = counts.loc[:,(word_counts >= 5)]

In [20]:
train_1 = train.copy()

train_1['index_original'] = train_1['index']
train_1 = train_1.drop('index', axis = 1)

train_1['target_label'] = train_1['target']
train_1 = train_1.drop('target', axis = 1) #This is necessary because there is the word target in text.

train_1['id_original'] = train_1['id']
train_1 = train_1.drop('id', axis = 1)

train_1['keyword_original'] = train_1['keyword']
train_1 = train_1.drop('keyword', axis = 1)

train_1['location_original'] = train_1['location']
train_1 = train_1.drop('location', axis = 1)

train_1['text_original'] = train_1['text']
train_1 = train_1.drop('text', axis = 1)

train_1 = pd.concat([train_1, counts], axis = 1)
train_1.head(5)

Unnamed: 0,tokenized_clean,index_original,target_label,id_original,keyword_original,location_original,text_original,shelter,place,evacuation,...,gunfire,richmond,exchanging,wounds,#kerricktrial,conclusively,wrecked,cramer,disneys,igers
0,"[deeds, reason, #earthquake, allah, forgive]",0,1,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[forest, fire, near, la, ronge, sask, canada]",1,1,4,,,Forest fire near La Ronge Sask. Canada,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[residents, asked, shelter, place, notified, officers, evacuation, shelter, place, orders, expected]",2,1,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,2,2,1,...,0,0,0,0,0,0,0,0,0,0
3,"[people, receive, #wildfires, evacuation, orders, california]",3,1,6,,,"13,000 people receive #wildfires evacuation orders in California",0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,"[got, sent, photo, ruby, #alaska, smoke, #wildfires, pours, school]",4,1,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
train_1['target_label'].value_counts(dropna = False)

0    4297
1    3188
Name: target_label, dtype: int64

In [22]:
#First model (Naive Bayes)
vocabulary = counts.columns

p_false = train_1['target_label'].value_counts(normalize = True, dropna = False)[0]
p_real = train_1['target_label'].value_counts(normalize = True, dropna = False)[1]

# Isolating false and real desaster messages
false_messages = train_1[train_1['target_label'] == 0]   
real_messages = train_1[train_1['target_label'] == 1]

# N_false
n_words_false_message = false_messages['tokenized_clean'].apply(len)
n_false = n_words_false_message.sum()

# N_Ham
n_words_real_message = real_messages['tokenized_clean'].apply(len)
n_real = n_words_real_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

print(p_false)
print(p_real)
print(n_false)
print(n_real)
print(n_vocabulary)

0.5740814963259853
0.4259185036740147
34859
28804
2446


In [23]:
# Initiate parameters
parameters_false = {unique_word:0 for unique_word in vocabulary}
parameters_real = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
    n_word_given_false = false_messages[word].sum()
    p_word_given_false = (n_word_given_false + alpha) / (n_false + alpha*n_vocabulary)
    parameters_false[word] = p_word_given_false
    
    n_word_given_real = real_messages[word].sum()
    p_word_given_real = (n_word_given_real + alpha) / (n_real + alpha*n_vocabulary)
    parameters_real[word] = p_word_given_real

In [24]:
def classify(message):

    p_false_given_message = p_false
    p_real_given_message = p_real

    for word in message:
        if word in parameters_false:
            p_false_given_message *= parameters_false[word]
            
        if word in parameters_real:
            p_real_given_message *= parameters_real[word]
            
    return p_real_given_message - p_false_given_message
    
train_1['dif_p_real_false'] = train_1['tokenized_clean'].apply(classify)
train_1['target_model'] = train_1.apply(lambda row: 1 if row['dif_p_real_false'] > 0 else 0, axis = 1)

train_1.head(5)

Unnamed: 0,tokenized_clean,index_original,target_label,id_original,keyword_original,location_original,text_original,shelter,place,evacuation,...,exchanging,wounds,#kerricktrial,conclusively,wrecked,cramer,disneys,igers,dif_p_real_false,target_model
0,"[deeds, reason, #earthquake, allah, forgive]",0,1,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,0,0,0,...,0,0,0,0,0,0,0,0,6.051816e-08,1
1,"[forest, fire, near, la, ronge, sask, canada]",1,1,4,,,Forest fire near La Ronge Sask. Canada,0,0,0,...,0,0,0,0,0,0,0,0,9.653049e-16,1
2,"[residents, asked, shelter, place, notified, officers, evacuation, shelter, place, orders, expected]",2,1,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,2,2,1,...,0,0,0,0,0,0,0,0,1.8402389999999998e-36,1
3,"[people, receive, #wildfires, evacuation, orders, california]",3,1,6,,,"13,000 people receive #wildfires evacuation orders in California",0,0,1,...,0,0,0,0,0,0,0,0,3.427481e-16,1
4,"[got, sent, photo, ruby, #alaska, smoke, #wildfires, pours, school]",4,1,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,0,0,0,...,0,0,0,0,0,0,0,0,-2.4580299999999998e-21,0


In [25]:
#evaluation
f1_score(train_1['target_label'], train_1['target_model'])

0.8038709677419354

In [26]:
#Add the variable "keyword" and model with logit and random forest.
train_2 = train_1.copy()

#Count words and remove any word that shows only one time
unique_tokens = []
single_tokens = []
for token in train_2['keyword_original']:
    if token not in single_tokens:
        single_tokens.append(token)
    elif token in single_tokens and token not in unique_tokens:
        unique_tokens.append(token)

            
counts_keywords = pd.DataFrame(0, index=np.arange(len(tokenized_text)), columns=unique_tokens)

for index, e in enumerate(tokenized_text):
    for token in e:
        if token in unique_tokens:
            counts_keywords.iloc[index][token] += 1
            
#Ignore any keyword with a counting less than 5 to prevent overfitting. 
keyword_counts = counts_keywords.sum(axis=0)
counts_keywords = counts_keywords.loc[:,(keyword_counts >= 5)]
vocabulary_keywords = counts_keywords.columns

counts_keywords = pd.concat([train_2[['keyword_original', 'target_label']], counts_keywords], axis = 1)



# Isolating false and real desaster messages
false_keyword = counts_keywords[counts_keywords['target_label'] == 0] 
real_keyword = counts_keywords[counts_keywords['target_label'] == 1]

# N_false
n_false_keyword = len(false_keyword['keyword_original'])

# N_Ham
n_real_keyword = len(real_keyword['keyword_original'])

# N_Vocabulary
n_vocabulary_keywords = len(vocabulary_keywords)

# Laplace smoothing
alpha = 1



# Initiate parameters
parameters_false_keyword = {unique_word:0 for unique_word in vocabulary_keywords}
parameters_real_keyword = {unique_word:0 for unique_word in vocabulary_keywords}

# Calculate parameters
for keyword in vocabulary_keywords:
    n_keyword_given_false = false_keyword[keyword].sum()
    p_keyword_given_false = (n_keyword_given_false + alpha) / (n_false_keyword + alpha*n_vocabulary_keywords)
    parameters_false_keyword[keyword] = p_keyword_given_false
    
    n_keyword_given_real = real_keyword[keyword].sum()
    p_keyword_given_real = (n_keyword_given_real + alpha) / (n_real_keyword + alpha*n_vocabulary)
    parameters_real_keyword[keyword] = p_keyword_given_real

In [27]:
def classify_keyword(keyword):

    p_false_given_keyword = p_false
    p_real_given_keyword = p_real

    if keyword in parameters_false_keyword:
        p_false_given_keyword *= parameters_false_keyword[keyword]
            
    if keyword in parameters_real_keyword:
        p_real_given_keyword *= parameters_real_keyword[keyword]
            
    return p_real_given_keyword - p_false_given_keyword
    
train_2['dif_keyword_p_real_false'] = train_2['keyword_original'].apply(classify_keyword)
train_2['target_model_keyword'] = train_2.apply(lambda row: 1 if row['dif_keyword_p_real_false'] > 0 else 0, axis = 1)

train_2.head(5)

Unnamed: 0,tokenized_clean,index_original,target_label,id_original,keyword_original,location_original,text_original,shelter,place,evacuation,...,#kerricktrial,conclusively,wrecked,cramer,disneys,igers,dif_p_real_false,target_model,dif_keyword_p_real_false,target_model_keyword
0,"[deeds, reason, #earthquake, allah, forgive]",0,1,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,0,0,0,...,0,0,0,0,0,0,6.051816e-08,1,-0.148163,0
1,"[forest, fire, near, la, ronge, sask, canada]",1,1,4,,,Forest fire near La Ronge Sask. Canada,0,0,0,...,0,0,0,0,0,0,9.653049e-16,1,-0.148163,0
2,"[residents, asked, shelter, place, notified, officers, evacuation, shelter, place, orders, expected]",2,1,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,2,2,1,...,0,0,0,0,0,0,1.8402389999999998e-36,1,-0.148163,0
3,"[people, receive, #wildfires, evacuation, orders, california]",3,1,6,,,"13,000 people receive #wildfires evacuation orders in California",0,0,1,...,0,0,0,0,0,0,3.427481e-16,1,-0.148163,0
4,"[got, sent, photo, ruby, #alaska, smoke, #wildfires, pours, school]",4,1,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,0,0,0,...,0,0,0,0,0,0,-2.4580299999999998e-21,0,-0.148163,0


In [28]:
#evaluationclassify_keyword
f1_score(train_2['target_label'], train_2['target_model_keyword'])

0.5156985871271585

In [29]:
#Unfortunately, the variable keyword is not very helpfull.
#Nonetheless, lets continue with the model.

def select_model(df,features):
    
    all_X = df[features]
    all_y = df["target_label"]

    # List of dictionaries, each containing a model name,
    # it's estimator and a dict of hyperparameters
    models = [
        {
            "name": "LogisticRegression",
            "estimator": LogisticRegression(random_state = 0),
            "hyperparameters":
                {
                    "solver": ["newton-cg", "lbfgs", "liblinear"],
                    "fit_intercept": [True, False],
                    "class_weight":["balanced", None]
                }
        },
        {
            "name": "RandomForestClassifier",
            "estimator": RandomForestClassifier(n_estimators = 300, random_state=1),
            "hyperparameters":
                {
                    "criterion": ["entropy", "gini"],
                    "max_depth": [2, 3, 4, 5],
                    "max_features": ["log2", "sqrt"]
                }
        }
    ]

    for model in models:
        print(model['name'])
        print('-'*len(model['name']))

        grid = GridSearchCV(model["estimator"],
                            param_grid=model["hyperparameters"],
                            cv=10,
                            scoring = make_scorer(f1_score))
        grid.fit(all_X,all_y)
        model["best_params"] = grid.best_params_
        model["best_score"] = grid.best_score_
        model["best_model"] = grid.best_estimator_

        print("Best Score: {}".format(model["best_score"]))
        print("Best Parameters: {}\n".format(model["best_params"]))

    return models

columns_models = ['target_model_keyword', 'target_model']
result = select_model(train_2, columns_models)

LogisticRegression
------------------
Best Score: 0.8037160366636067
Best Parameters: {'class_weight': 'balanced', 'fit_intercept': True, 'solver': 'newton-cg'}

RandomForestClassifier
----------------------
Best Score: 0.8037160366636067
Best Parameters: {'criterion': 'entropy', 'max_depth': 2, 'max_features': 'log2'}



In [30]:
#Use continuous variables to model
columns_models = ['dif_p_real_false', 'dif_keyword_p_real_false']
result = select_model(train_2, columns_models)

LogisticRegression
------------------
Best Score: 0.42310589687588074
Best Parameters: {'class_weight': 'balanced', 'fit_intercept': False, 'solver': 'newton-cg'}

RandomForestClassifier
----------------------
Best Score: 0.6596675475688312
Best Parameters: {'criterion': 'gini', 'max_depth': 5, 'max_features': 'log2'}



It seens that the best model is the first naive Bayes.

In [31]:
#Clean and tokennize the text from the holdout
holdout["tokenized_clean"] = holdout["text"].apply(normalize_text)

#score the holdout
holdout['dif_p_real_false'] = holdout['tokenized_clean'].apply(classify)
holdout['target'] = holdout.apply(lambda row: 1 if row['dif_p_real_false'] > 0 else 0, axis = 1)
holdout_ids = holdout[['id', 'target']]

holdout_ids.to_csv("submission.csv",index=False)
holdout_ids.head()

#Kaggle score: 0.77811

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
