# Natural Language Processing with Disaster Tweets
Basic idea into current challenge is predict which ones are disasters and which ones are not.
Complexity came from Tweets where some words have metaphorical means, so is necessary train model to identify correct cases where exist disaster.

train_data : (7613 rows)
| id | keyword | location | text | target |
Here feature columns are twitter text, keyword and location, although let's be careful, location and keyword will be empty!

target column is label for ML model (1 denotes disaster and 0 is no disaster).

test data : (3263 rows)
The same information without label column.

In [99]:
# Importing required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import naive_bayes
from sklearn import metrics

[nltk_data] Downloading package punkt to /home/sebastian/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [100]:
# Loading train - test data
train_data = pd.read_csv('resources/challenge_05/train.csv')
test_data = pd.read_csv('resources/challenge_05/test.csv')

First look into data

In [101]:
print(train_data.info())
train_data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB
None


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [102]:
print(test_data.info())
test_data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB
None


Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
5,12,,,We're shaking...It's an earthquake
6,21,,,They'd probably still show more life than Arse...
7,22,,,Hey! How are you?
8,27,,,What a nice hat?
9,29,,,Fuck off!


In [103]:
# As a first assessment is possible checkout missing values into keyword and location columns, so is needed some strategy to solve missing data into corresponding columns.

# Preprocessing datasets

In [104]:
# Now, we are going to create train - test sets from train_data to create model and evaluate accuracy.
label = train_data.target
train_data = train_data.drop("target", axis=1)
x_train, x_test, y_train, y_test = train_test_split(train_data, label, test_size = 0.30, random_state = 42, stratify=label)

Using Tokenizer
Is used to splitting up a large body of text into smaller lines, words.

In [105]:
# Example, first text tweet - Here is splitting text data into corresponding words and signs
example = train_data.loc[0, "text"]
example_split = word_tokenize(example)
print(example_split)

['Our', 'Deeds', 'are', 'the', 'Reason', 'of', 'this', '#', 'earthquake', 'May', 'ALLAH', 'Forgive', 'us', 'all']


# Next step, checkout recommended method by example guide - tfidfVectorizer
For this purpose i go to use example from web
https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a
https://www.etutorialspoint.com/index.php/386-tf-idf-tfidfvectorizer-tutorial-with-examples



In [106]:
# To understand easily we can see the next common example :

# list of text documents - CORPUS
text = ["The cycle is ridden on the track.",
	"The bus is driven on the road.",
	"He is driving the bus."]

# create the transform object
vectorizer = TfidfVectorizer()

# tokenize and build vocabulary
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'the': 9, 'cycle': 1, 'is': 5, 'ridden': 7, 'on': 6, 'track': 10, 'bus': 0, 'driven': 2, 'road': 8, 'he': 4, 'driving': 3}
[1.28768207 1.69314718 1.69314718 1.69314718 1.69314718 1.
 1.28768207 1.69314718 1.69314718 1.         1.69314718]


At moment, I understand that :
 - first step is split all documents into words and create vocabulary (Tokenize process)
 - second step is assign index in alphabetical order
 - third step is create idf vector where every value correspond to importance into overall corpus for every word for vocabulary, for example sixth component that correspond to
   "on", have low importance into current corpus (I understand that is stop word)

In [107]:
# Now, we are going to create TfidfVectorizer object and fit over train_data
tfidf = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None) # we use as tokenizer, the same use above into example
tfidf.fit(x_train.text)

# transform training and validation data tweets
x_train = tfidf.transform(x_train.text)
y_train = np.array(y_train)

x_test = tfidf.transform(x_test.text)
y_test = np.array(y_test)

x_train

<5329x18111 sparse matrix of type '<class 'numpy.float64'>'
	with 90337 stored elements in Compressed Sparse Row format>

# Now, we have train - test data into correct format, so we can create ML model to train and after predict over test dataset.

In [108]:
# We use naive_bayes model to train over dataset and create predictions for new Tweets.
model = naive_bayes.MultinomialNB()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [109]:
# Now, we can check out model accuracy using predictions from test subset
accuracy = metrics.accuracy_score(y_pred, y_test)
accuracy

0.8051663747810858

# Finally, I can transform test_data feature (current only use text column) to correct numerical format using tfidf method and predict label column


In [110]:
test_data_transform = tfidf.transform(test_data.text)
y_pred_test = model.predict(test_data_transform)

In [111]:
submission = pd.DataFrame({
    'id' : test_data.id,
    'target' : y_pred_test
})
submission.to_csv("submission.csv", index=False)

Here my score from Kaggle was 0.78394!