<h1>Natural Language Processing with Disaster Tweets Competition</h1>

<h2>Import Depedencies</h2>

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing, svm

<h2>Import Data</h2>

In [2]:
train_df = pd.read_csv("./data//train.csv")
test_df = pd.read_csv("./data/test.csv")

In [3]:
train_df.head(), train_df.isna().sum(), train_df['keyword'].unique()

(   id keyword location                                               text  \
 0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
 1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
 2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
 3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
 4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   
 
    target  
 0       1  
 1       1  
 2       1  
 3       1  
 4       1  ,
 id             0
 keyword       61
 location    2533
 text           0
 target         0
 dtype: int64,
 array([nan, 'ablaze', 'accident', 'aftershock', 'airplane%20accident',
        'ambulance', 'annihilated', 'annihilation', 'apocalypse',
        'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
        'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
        'blazing', 'bleeding', 'blew%20up', 'blight', 'blizzard', 

<li>Label = 0 -> Non Disaster Tweet</li>
<li>Label = 1 -> Disaster Tweet</li>

In [4]:
# Example of Non Disaster Tweet
train_df[train_df["target"] == 0]["text"].values[:5]

array(["What's up man?", 'I love fruits', 'Summer is lovely',
       'My car is so fast', 'What a goooooooaaaaaal!!!!!!'], dtype=object)

In [5]:
# Example of Disaster Tweet
train_df[train_df["target"] == 1]["text"].values[:5]

array(['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
       'Forest fire near La Ronge Sask. Canada',
       "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
       '13,000 people receive #wildfires evacuation orders in California ',
       'Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school '],
      dtype=object)

<p>A way of verifying if a tweet is related to a disaster or not, may be by looking at the words contained in the tweet<p>

<h2>Building Vectors

<p>Let's put into vectors the words and their count from each of the tweets<p>
<p>We will use CountVectorizer from sckikit-learn<p>

In [6]:
count_vectorizer = feature_extraction.text.CountVectorizer()

In [7]:
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [8]:
# Get all the unique words in teh first 5 tweets
count_vectorizer.get_feature_names_out()

array(['000', '13', 'alaska', 'all', 'allah', 'are', 'as', 'asked',
       'being', 'by', 'california', 'canada', 'deeds', 'earthquake',
       'evacuation', 'expected', 'fire', 'forest', 'forgive', 'from',
       'got', 'in', 'into', 'just', 'la', 'may', 'near', 'no', 'notified',
       'of', 'officers', 'or', 'orders', 'other', 'our', 'people',
       'photo', 'place', 'pours', 'reason', 'receive', 'residents',
       'ronge', 'ruby', 'sask', 'school', 'sent', 'shelter', 'smoke',
       'the', 'this', 'to', 'us', 'wildfires'], dtype=object)

In [9]:
train_df["text"][0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [10]:
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


<p>We are checking the occurences of the different words that exist in all 5 first tweets, in the first tweet.
<p>The 0 values mean that the correspondent word is not in the tweet

In [11]:
# For all the tweets
train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])
# For the test we only use transform in order to only use the set of the train words

In [12]:
count_vectorizer.get_feature_names_out()

array(['00', '000', '0000', ..., 'ûónegligence', 'ûótech', 'ûówe'],
      dtype=object)

<p>It is assumed that the presence of a word or set of words gives us info into if the tweet is disaster related or not. This relationship is linear, so let's build a linear model</p>

In [37]:
clf = linear_model.RidgeClassifier(alpha=30)

In [38]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.61894025, 0.59903897, 0.67908903])

In [39]:
clf.fit(train_vectors, train_df["target"])

In [40]:
preds = clf.predict(test_vectors)

In [24]:
test_df['id']

0           0
1           2
2           3
3           9
4          11
        ...  
3258    10861
3259    10865
3260    10868
3261    10874
3262    10875
Name: id, Length: 3263, dtype: int64

In [25]:
test_vectors.shape

(3263, 21637)

In [26]:
preds.shape

(3263,)

In [41]:
out = [test_df['id'], preds]

In [42]:
predict = pd.DataFrame({'id': test_df['id'], 'target':preds})
predict.to_csv("./predictions/predictions_4.csv", index=False)