# Solution

## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.svm import SVC
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.neighbors import KNeighborsClassifier

import os

## Data analysis

Let's see what the data looks like:

In [2]:
DATASET_DIR = "./dataset"

In [3]:
train = pd.read_csv(os.path.join(DATASET_DIR, "train.csv"))
test = pd.read_csv(os.path.join(DATASET_DIR, "test.csv"))

In [5]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
train.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [7]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
train.isna().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [9]:
test.isna().sum()

id             0
keyword       26
location    1105
text           0
dtype: int64

In [10]:
len(train)

7613

In [11]:
len(test)

3263

## Training a model

Here I will be following the Kaggle quickstart notebook associated with this challenge (https://www.kaggle.com/philculliton/nlp-getting-started-tutorial):

In [4]:
count_vectorizer = feature_extraction.text.CountVectorizer()

In [5]:
## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train["text"][0:5])

In [19]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


In [5]:
train_vectors = count_vectorizer.fit_transform(train["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test["text"])

From the quickstart notebook:

"As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a linear connection. So let's build a linear model and see!"

## Linear model

In [21]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

In [23]:
scores = model_selection.cross_val_score(clf, train_vectors, train["target"], cv=3, scoring="f1")
scores

array([0.59421842, 0.56455572, 0.64149093])

Onwards to submit the linear model:

In [28]:
clf.fit(train_vectors, train["target"])

RidgeClassifier()

In [29]:
sample_submission = pd.read_csv(os.path.join(DATASET_DIR, "sample_submission.csv"))

In [30]:
sample_submission["target"] = clf.predict(test_vectors)

In [31]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [32]:
sample_submission.to_csv("submission.csv", index=False)

## SVM

OK, let me try an SVM (without hyperparameter search):

In [25]:
clf = SVC(gamma="auto")

In [27]:
scores = model_selection.cross_val_score(clf, train_vectors, train["target"], cv=3, scoring="f1")
scores

array([0., 0., 0.])

Hmmm... This is strange. SVM gets a F1 score of 0? That's rather bad.

## kNN

Let me try kNN:

In [7]:
clf = KNeighborsClassifier()

In [10]:
scores = model_selection.cross_val_score(clf, train_vectors, train["target"], cv=3, scoring="f1")
scores

array([0.19700552, 0.10376569, 0.20187793])

OK, the kNN seems to be doing much worse than the linear model. I won't be doing any fine-tuning on my laptop since that could take some time, but I will write the code that I'd use:

In [14]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
                "n_neighbors" : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                "weights" : ["uniform", "distance"],
                "algorithm" : ["auto", "ball_tree", "kd_tree", "brute"]
             }

randomized_search = RandomizedSearchCV(clf, param_grid)
randomized_search_results = randomized_search.fit(train_vectors, train["target"])



KeyboardInterrupt: 

Here's how a grid search would look like:

### Note: Since this searches for every combination of the parameters, I'd just run it with the _n\_neighbors_ parameter.

In [15]:
from sklearn.model_selection import GridSearchCV

param_grid = {
                "n_neighbors" : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
             }
grid_search = GridSearchCV(clf, param_grid)
grid_search_results = grid_search.fit(train_vectors, train["target"])

KeyboardInterrupt: 

## Neural network

Let me try a neural network now.

First, let me look at the dimensions of the input vector:

In [17]:
train_vectors.shape

(7613, 21637)

In [19]:
model = keras.Sequential(
    [
        layers.Dense(32, activation="relu", input_shape=(21637, )),
        layers.Dense(64, activation="relu"),
        layers.Dense(128, activation="relu"),
        layers.Dense(1, activation="sigmoid")
    ]
)

In [20]:
model.compile(optimizer=keras.optimizers.RMSprop(), loss="binary_crossentropy", metrics=["accuracy"])

In [23]:
model.fit(train_vectors.todense(), train["target"], epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f39a0df4c10>

### Note: I could have tried using 1D convolutions here etc. , but ~96% accuracy seems good enough. Also, I'm not sure if 1D convolutions would be a good choice for this task. Also, I had a bug related to input dimensions with a RNN (below), so I'd have to first fix that before using 1D convolutions.

Let me now predict using this neural network and submit the solution:

In [25]:
predictions = model.predict(test_vectors.todense())

In [26]:
predictions

array([[0.90038764],
       [0.9743723 ],
       [0.9975591 ],
       ...,
       [0.99737537],
       [0.9910486 ],
       [0.9966301 ]], dtype=float32)

Let's convert the predictions to classes:

In [30]:
predictied_classes = [1 if prediction > 0.5 else 0 for prediction in predictions] # 0.5 is my threshold

In [31]:
predictied_classes

[1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,


In [32]:
sample_submission = pd.read_csv(os.path.join(DATASET_DIR, "sample_submission.csv"))

In [33]:
sample_submission["target"] = predictied_classes

In [34]:
sample_submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1


In [35]:
sample_submission.to_csv("submission.csv", index=False)

## RNN

Let's try an RNN with LSTM layers.

For this we first have to reshape the input to the model:

In [6]:
train_vectors.shape

(7613, 21637)

In [10]:
train_vectors_reshaped = train_vectors.todense().reshape((7613, 1, 21637)) # LSTM expects dimensions that are (samples, time_steps, features)

In [11]:
train_vectors_reshaped

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [15]:
train_vectors_reshaped.shape

(7613, 21637)

OK, let's represent the features as time steps to see if that helps:

In [16]:
train_vectors_reshaped = train_vectors.todense().reshape((7613, 21637, 1)) # LSTM expects dimensions that are (samples, time_steps, features) (https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/)

In [17]:
train_vectors_reshaped

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [18]:
train_vectors_reshaped.shape

(7613, 21637)

No, it doesn't help...

### TODO: Figure out how to reshape train_vectors so that they fit into LSTM

### Note: I'm not sure if RNNs are a good choice for this task.

In [12]:
model = keras.Sequential(
    [
        layers.LSTM(32, input_shape=(1, 21637), return_sequences=True),
        layers.LSTM(64, return_sequences=True),
        layers.LSTM(128, return_sequences=True),
        layers.Dense(1, activation="sigmoid")
    ]
)

In [13]:
model.compile(optimizer=keras.optimizers.RMSprop(), loss="binary_crossentropy", metrics=["accuracy"])

In [14]:
model.fit(train_vectors_reshaped, train["target"], epochs=5)

Epoch 1/5


ValueError: in user code:

    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
        outputs = model.train_step(data)
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:754 train_step
        y_pred = self(x, training=True)
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:998 __call__
        input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
    /home/mislav/anaconda3/envs/kraken/lib/python3.7/site-packages/tensorflow/python/keras/engine/input_spec.py:223 assert_input_compatibility
        str(tuple(shape)))

    ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 21637)


## TODOs

 - RNN (fix input dimensions bug)
 - Conv1D (try using this layer)
 - BERT + some classifiers (first some package / import issues have to be fixed)