In [71]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing


In [72]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.shape, test_df.shape

((7613, 5), (3263, 4))

In [73]:
# A quick look at the data 
train_df[train_df['target']==1]['text'].values[0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [74]:
train_df.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [75]:
'''
BUILDING VECTORS

The theory behind the model is that the words in each tweet contains a good indicator
of wheather the post is about a real disaster or not (Well its not entirely correct
but its a good place to start). 

'CountVectorizer' will be used from 'scikit-learn' library for each tweet and 
turn them to data our ML model can process

'CountVectorizer converts collection of text documents into matrix of token counts.

Following are the steps performed by CountVectorizer:
------------------------------------------------------
1. It splits the input text into individual words or tokens
2. Converts all texts into lowercase letters
3. Removes punctuations
4. Creates a Document Term Metrix(DTM) -> Each document is represented as a vector
where each dimention correosponds to a word(or token) in the vocabulary. 
5. Constructs a vocabulary of all the unique words found in the text corpus
6. Transforms text into numerical features which can be used by the ML algorithm.
'''

count_vectorizer = feature_extraction.text.CountVectorizer()

To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation:

x′ = (x−μ)/σ

You do that on the training set of the data. But then you have to apply the same transformation to your test set (e.g. in cross-validation), or to newly obtained examples before forecasting. But you have to use the exact same two parameters μ
 and σ
 (values) that you used for centering the training set.

Hence, every scikit-learn's transform's fit() just calculates the parameters (e.g. μ
 and σ
 in case of StandardScaler) and saves them as an internal object's state. Afterwards, you can call its transform() method to apply the transformation to any particular set of examples.

fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set x
, while also returning the transformed x′
. Internally, the transformer object just calls first fit() and then transform() on the same data.

In [76]:
# Lets get counts for first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df['text'][0:5])

In [77]:
# .todense() is used because the vectors will be "sparse" vectors as data is 
# very less for the vectorizer to form a dense vectors
print(example_train_vectors.todense().shape)
print(example_train_vectors.todense())

(5, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 2 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 0 1 1 0 1 1 1 1 0 0
  0 2 0 0 0 1 0 0 0 0 0 2 0 0 0 1 0 0]
 [1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1
  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
  1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1]]


The above output shows that:
1. There a are 54 unique tokens in the 5 tweets
2. The first tweets contains only some of those unique tokens - all of the non-zeor counts above are the tokens that Do exist in the tweet.

In [78]:
# Creating vectors for all our tweets

train_vectors = count_vectorizer.fit_transform(train_df["text"])

''' 
Note that we are not using fit_transform() method here. Using just transform() to
make sure that the train vectors are the only ones mapped to the test vectors
i.e., the train and test vectors use the same set of tokens.

1. fit_transform() learns the vocabulary and transforms the training data into 
document-term matrix
2. if fit_transform is used on test data also it wwould learn a new vocabolary
based on the test data, potentialy including tokens not present in the training
data. This would result in a different set of tokens for the test data cousing 
inconsistencies. So, only transform is usded for test data.
'''
test_vectors = count_vectorizer.transform(test_df['text'])

In [79]:
# Ridge (L2) Regression
'''
Ridge regression aims for fitting the data well and keeping the coefficients small,
i.e., to create a balance between bias and varience. This is done by adding a penalty
term known as 'regularization parameter' or 'ridge parameter'

The ridge regression adds a penalty term lambda to the ordinary list square cost function
of normal regression:

COST FUNCTION = SUM OF SQUARED ERROR + LAMBDA X SUM OF SQUARE OF THE COEFFICIENTS

For ridge classification(Binary Classification) a linear line or a hyperplane in higher dimensions is
used to seperate different classes. 

For ridge classifier(Multi-class classification) one-vs-rest approach is taken, for ex:
if there are 3 classes we train three seperate ridge classifiers
-> One classify to distinguish class A from class B and C.
-> One Classify to distinguish class B from class A and C.
-> One classify to distinguish class C from class A and B.
During prediction the class that gives the highest confidence determines the class.
Or One-vs-One approach is used: A ridge classifier is trained for each payer of classes.
During prediction, each classifier votes for a class and the most votes win.
'''

clf = linear_model.RidgeClassifier()

In [80]:
scores = model_selection.cross_val_score(clf, train_vectors, 
                                         train_df['target'], cv=3, 
                                         scoring='f1')
scores

array([0.59453669, 0.5642787 , 0.64082434])

In [81]:
# The above score is not the best as the highest f1 score is .64
# To improve this we can implement TFIDF, LSA, RNN/LSTMs etc.

# Runnning predictions on training data and build submission for the competition.

clf.fit(train_vectors, train_df['target'])

In [82]:
# out_df = pd.DataFrame()

# out_df['id'] = train_df['id']
# out_df['target'] = clf.predict(test_vectors)

# columns_to_export = out_df[['id','target']]

# out_df.head()

sample_submission = pd.read_csv("sample_submission.csv")
sample_submission["target"] = clf.predict(test_vectors)
sample_submission.head(20)

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1
5,12,1
6,21,0
7,22,0
8,27,0
9,29,0


### With CountVectorizer and Ridge Classifier F1 score is : 0.64

# Using TF-IDF

In [83]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import linear_model, model_selection
from sklearn.metrics import f1_score
import pandas as pd

In [84]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.shape, test_df.shape

((7613, 5), (3263, 4))

In [85]:
# Initialize the TF-IDF vectorizer
tfidf = TfidfVectorizer()

In [86]:
# TF-IDF vector should fit only on the training data and learn the vocabulary and term frequencies
# Once the vectorizer is fitted then we use transform both the train and test data into feature vectors based on the learned vocabulary 
train_vectors = tfidf.fit_transform(train_df['text'])
test_vectors = tfidf.transform(test_df['text'])

In [87]:
# Initialize the ridge classifier
rc = linear_model.RidgeClassifier()

In [88]:
# Checking F1 score using cross validaiton
score = model_selection.cross_val_score(rc, train_vectors,
                                        train_df['target'], cv=3,
                                        scoring='f1')
score

array([0.63366337, 0.6122449 , 0.68407835])

In [89]:
# Fit the entire ridge regressor on the training data
rc.fit(train_vectors, train_df['target'])

In [90]:
# Predict on test data
predictins = rc.predict(test_vectors)

In [91]:
print(predictins[:5])

[1 1 1 0 1]


### With TF-IDF and Ridge Clssifier the F1 Score is: 0.68

# Using RNN/LSTM

In [92]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import Callback
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [93]:
# print(f"Tensorflow running on {len(tf.config.list_physical_devices('GPU'))} gpus")

In [94]:
# print(f"Tensorflow running on {len(tf.config.list_logical_devices('GPU'))} logical devices")

In [95]:
# print(tf.test.is_built_with_cuda())

In [96]:
# !nvidia-smi

In [97]:
# Introduce tokenization and padding
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>')
tokenizer.fit_on_texts(test_df['text'])

train_sequence = tokenizer.texts_to_sequences(train_df['text'])
test_sequence = tokenizer.texts_to_sequences(test_df['text'])

max_length = max(len(seq) for seq in train_sequence)
train_padded = pad_sequences(train_sequence, padding='post', maxlen=max_length)
test_padded = pad_sequences(test_sequence, padding='post', maxlen=max_length)

In [98]:
# Applying train test split
X_train, X_val, y_train, y_val = train_test_split(train_padded, train_df['target'], test_size=0.2, random_state=50)

In [99]:
# Defining the LSTM model

model = Sequential(
    [Embedding(input_dim=10000, output_dim=64, input_length=max_length),
    LSTM(64, return_sequences=False),
    Dense(1, activation='sigmoid')]
)



In [100]:
# Compiling the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [101]:
# Custom callback for f1 score
class F1ScoreCallback(Callback):
    def __init__(self, val_data, val_labels):
        super().__init__()
        self.val_data = val_data
        self.val_labels = val_labels

    def on_epoch_end(self, epoch, logs=None):
        # Predict on validation data
        val_predictions = self.model.predict(self.val_data)
        val_predictions = (val_predictions > 0.5).astype(int).flatten()
        f1_val = f1_score(self.val_labels, val_predictions)

        # Print the f1 score
        print(f"Epoch {epoch+1} -> f1 socre: {f1_val:0.4f}")

In [102]:
# Create an instance for the F1CoreCallback class
f1_callback = F1ScoreCallback(val_data=X_val, val_labels=y_val)

In [103]:
# Performing the training sequence
return_data = model.fit(train_padded, 
                        train_df['target'], 
                        epochs=50, 
                        batch_size=32, 
                        validation_data=(X_val, y_val),
                        callbacks=[f1_callback])

Epoch 1/50
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step p - accuracy: 0.6270 - loss: 0
Epoch 1 -> f1 socre: 0.8325
[1m238/238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.6287 - loss: 0.6292 - val_accuracy: 0.8647 - val_loss: 0.3491
Epoch 2/50
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step p - accuracy: 0.8491 - loss: 0.3
Epoch 2 -> f1 socre: 0.8599
[1m238/238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.8490 - loss: 0.3725 - val_accuracy: 0.8851 - val_loss: 0.3052
Epoch 3/50
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step p - accuracy: 0.8789 - loss: 0.3
Epoch 3 -> f1 socre: 0.8856
[1m238/238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.8788 - loss: 0.3183 - val_accuracy: 0.8949 - val_loss: 0.2680
Epoch 4/50
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step p - accuracy: 0.9014 - loss: 0

In [104]:
# Evaluating model on test set 
predictions = model.predict(test_padded)
predictions = (predictions > 0.5).astype(int)

[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step


In [105]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['target'] = predictions
sample_submission.head(20)

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
5,12,1
6,21,0
7,22,0
8,27,0
9,29,0


In [106]:
# writting predictions to csv for submission
sample_submission.to_csv('sipan_submission.csv', index=False)

In [108]:
submission = pd.read_csv('sipan_submission.csv')
submission.shape

(3263, 2)

As per kaggle compition criterion, submission should be a CSV file with 3263 rows and a header

#### With above submission, kaggle score: 0.74655

## To improve the kaggle score will try to apply cross validation during training of the model