**Project Name : Real or Not? NLP with Disaster Tweets**

In this competition, we’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.

Prize Money : 10 000$

Authors : Merhbene Oumaima AND Azzouz Myriam 

**References**

* Source for bert_encode function: https://www.kaggle.com/user123454321/bert-starter-inference
* All pre-trained BERT models from Tensorflow Hub: https://tfhub.dev/s?q=bert
* https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary

**Kernels that inspired us ** 
* https://www.kaggle.com/cabonfim10/basic-nlp-disaster-tweets
* https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert
* https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub



In this Notebook two approches were introduced . The first approach uses Neural Networks and the second approach uses Bert : Bidirectional Encoder Representations from Transformers wich uses a pretrained model on Tensorflow Hub . As there are two Bert models : base and large , in our project we will be using the base model : BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters . As a conclusion we will compare both approaches .

Plan : 

1/ NLP using Neural Networks

* Data Visualization

* Data Preprocessing

* Model Building

*   Submission








2/ NLP using BERT


* Bert input Encoding

* Bert model 

3/Conclusion 

   

---




**Library Load**


In [0]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import string

from keras.models import Sequential
from keras.layers import Dense ,Activation, Dropout, BatchNormalization
from keras.optimizers import Adam

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


In [0]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Data Load**

In [0]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

In [0]:
train_df.head(3)

In [0]:
test_df.head(3)

In [0]:
print('There are {} rows and {} columns in train'.format(train_df.shape[0],train_df.shape[1]))
print('There are {} rows and {} columns in test'.format(test_df.shape[0],test_df.shape[1]))






# 1/ NLP using Neural Networks 

* **Data Visualization:**



---





Let's do some exploration and visualization of the data. This will help us to gain some valuable insights about the dataset.

In [0]:
import seaborn as sns
from matplotlib import pyplot as plt

fig, axes = plt.subplots(ncols=2, figsize=(17, 4), dpi=100)
plt.tight_layout()

labels=['Disaster Tweet','No Disaster']
size=  [train_df['target'].mean()*100,abs(1-train_df['target'].mean())*100]
explode = (0, 0.1)
#ig1,ax1 = plt.subplots()
axes[0].pie(size,labels=labels,explode=explode,shadow=True,
            startangle=90,autopct='%1.1f%%')
sns.countplot(x=train_df['target'], hue=train_df['target'], ax=axes[1])
plt.show()

There are only two classes 0 and 1.
The dataset is balanced. And we can also understand that tweets about disaster are less than those about no disaster.

* **Data Preprocessing**


---



Tweets always have to be cleaned before we go onto modelling.So we will do some basic cleaning such as removing punctuations,removing html tags and emojis ect ..



Our end goal is to have our input as a sentence using only keywords, we can then encode those words and try to find patterns in them later on.


***A- Cleaning Data :***

**1-Removing URLS and emoji:**




First let's get rid of any website URLs in the tweets

In [0]:
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

**2-Removing punctuation:**

Next we want to remove any punctuation, especially important considering the fact that this will remove the '@' and '#' symbols that are so common on twitter.

In [0]:
def remove_punc(text):
    return text.translate(str.maketrans('', '', string.punctuation))    

**3-Tokenization:**

At this point we've removed a lot of the fluff that goes with tweets, now let's tokenize them. This is a common form of lexical analysis which will split the tweets into its individual components (words/sentences/etc)

In [0]:
def tokenization(text):
    text = re.split('\W+', text)
    return text

**4-Removing Stop words:**

Next step is to remove the stopwords. These are the standard words like 'and', 
'the', 'a' and so on. These are so prevalent in language that they'll cloud our data. We only want to deal with keywords so we'll get rid of these stopwords



In [0]:
import nltk
from nltk.tokenize import word_tokenize
stopword = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    text = [word for word in text if word not in stopword]
    return text

Now it's time to apply these functions to our dataset, the result will be a much cleaner set of tweets. This will help a lot when we try to build a model later on



In [0]:
for datas in [train_df,test_df]:
    datas['text'] = datas['text'].apply(lambda x : remove_url(x))
    datas['text'] = datas['text'].apply(lambda x : remove_emoji(x))
    datas['text'] = datas['text'].apply(lambda x : remove_punc(x))
    datas['text'] = datas['text'].apply(lambda x : tokenization(x.lower()))
    datas['text'] = datas['text'].apply(lambda x : remove_stopwords(x))
    datas['text'] = datas['text'].apply(lambda x : ' '.join(x))

**B-Tweets Encoding**





It's a way to turn our tweets into numbers. One good way is by vectorising them. This will maintain the relationship between similar words/phrases and will allow our model to find patterns in the data. 

In [0]:
count_vec = CountVectorizer()
train_vec = count_vec.fit_transform(train_df['text'])
test_vec = count_vec.transform(test_df['text'])

In [0]:
print(train_vec.shape)
print(train_vec.toarray())



*   **Model Building:**




---




Next we will build the actual model. We will use the neural network ,the key tool of machine learning. To reduce the overfitting in neural networks , we well use the regularization technique Dropout. In addition , to improve the speed, performance, and stability we will use Batch normalization .


In [0]:

model = Sequential()

model.add(Dense(1024, activation='relu', input_shape=(17667,)))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu', input_dim=512))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()




In [0]:
x_train=train_vec
y_train=train_df["target"]
x_test=test_vec

          
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = y_train[:1000]
partial_y_train = y_train[1000:]          

history = model.fit (partial_x_train, partial_y_train,verbose=1,epochs=20,batch_size=32, validation_data=(x_val, y_val))






*   **Submission:** 


---






Finally we take our predictions and upload them to the competition. The final score was 0.779, which isn't too bad considering the naivety of this implementation. 

In [0]:
sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
test_pred = model.predict(x_test)
sample_submission['target'] = test_pred.round().astype(int)
sample_submission.head()


In [0]:
sample_submission.to_csv("submission.csv", index=False)

# 2/ **NLP using BerT**

BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.


In [0]:
!pip install bert-tensorflow

Official tokenization script created by the Google team

In [0]:
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [0]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
import tokenization

BERT uses three embeddings to compute the input representations. They are :

-token embeddings

-segment embeddings

-position embeddings.

“CLS” is the reserved token to represent the start of sequence while “SEP” separate segment (or sentence).
[CLS] : The first token of every sequence. A classification token which is normally used in conjunction with a softmax layer for classification tasks. For anything else, it can be safely ignored.

In this function : 'bert_encode' max_len=512 , it means that we consider a tweet sentence having at max 512 word.
We can use up to 512, but we can use shorter if possible for memory and speed reasons.

In [0]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
        print('text',text)
            
        text = text[:max_len-2]
        #“CLS” is the reserved token to represent the start of sequence
        #“SEP” separate segment (or sentence)
        input_sequence = ["[CLS]"] + text + ["[SEP]"] 
        #regrioupement : padding ramener tous les inputs à une meme longueur pour tre traités par Bert 
        pad_len = max_len - len(input_sequence)
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        print('tokens',tokens)
        #regroupement : padding ramener tous les inputs à une meme longueur pour tre traités par Bert
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        print('pad',pad_masks)
        segment_ids = [0] * max_len
        print('segment_ids',segment_ids)
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

Once trained, the Bert model can provide several tasks ranging from binary classification of a text to answering questions and translating.

We specify the type of output in sequence_output [:, 0 ,:] to say that we want all the sequences entered at the first position [CLS]: classification and all the hidden layers.

Finally we activate sigmoid in order to return a number between 0 and 1.

In [0]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")
    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    print(sequence_output) # sequence_output shape is (batch_size, max_len, hidden_dim)
    clf_output = sequence_output[:, 0, :]
    print('clf : ',clf_output)
    out = Dense(1, activation='sigmoid')(clf_output) 
    print('out : ',out)
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

### Load BERT from TFHub

* Load BERT from the Tensorflow Hub


In [0]:
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)



* Load tokenizer from the bert layer


A vocab file (vocab.txt) is used to map WordPiece to word id.

In [0]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

* Encode the text into tokens, masks, and segment flags

In [0]:
train_input = bert_encode(train_df.text.values, tokenizer, max_len=160)
test_input = bert_encode(test_df.text.values, tokenizer, max_len=160)
train_labels = train_df.target.values

In [0]:
model = build_model(bert_layer, max_len=160)
model.summary()


In this model we could also use a validation split.

In [0]:
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    epochs=3,
    callbacks=[checkpoint],
    batch_size=64
) 

In [0]:
test_pred = model.predict(test_input)

In [0]:
sample_submission['target'] = test_pred.round().astype(int)
sample_submission.to_csv('submission2.csv', index=False)

# 3/ Conclusion 

In conclusion the neural networks model gives us an accuracy of 0.779 and we were ranked 927 , while Bert approach gives an accuracy of 0.833 and our actual ranking is 293/1338 .
Bert official paper developed by google and  Bert tutorials we had mentioned on top of our notebook helped us a lot get into the transfer learning used in NLP that helps gain time training  and efficiency . 