## Real or Not? NLP with Disaster Tweets

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).


#### Data Description
id - a unique identifier for each tweet <br>
text - the text of the tweet<br>
location - the location the tweet was sent from (may be blank)<br>
keyword - a particular keyword from the tweet (may be blank)<br>
target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)<br>

In [None]:
#Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
#Importing the train and test datasets into respective dataframes with read_csv function
df = pd.read_csv('../input/nlp-getting-started/train.csv')
df_test = pd.read_csv("../input/nlp-getting-started/test.csv")

In [None]:
#To display maximum column width
pd.set_option('display.max_colwidth',None)

## DATA UNDERSTANDING

In [None]:
#Observing the first few records of the train dataset
df.head(10)

In [None]:
#Checking the dimension of the train and test datasets
print(df.shape)
print(df_test.shape)

In [None]:
#Basic information on the dataframes
df.info()

We can see from the dataset that there are 7613 records in total in the train dataset wherein some columns are having NULL values present within.<br>
Also, the columns keyword,location and text are of object type<br>
We can go ahead and find out the count of NULL values for each column

In [None]:
#NULL count in train and test
print(df.isnull().sum())
print(df_test.isnull().sum())

Location is one column which is having many missing values in both train and test. The text column which we are interested in, is having no NULL values.

In [None]:
#Initial statistics for the columns
#We will see describe function even for the object column, so that we can get some pointers on duplicate values.
df.describe(include='all')

We can see the unique count of text column to be 7503 which means there are some duplicate tweet records present within. We can address this issue at a later point of time.<br>
Onto EDA!!

## EXPLORATORY DATA ANALYSIS

In [None]:
df.head()

### Balanced or Imbalanced?

In [None]:
#count for each category target
#Target 1 stands for real disaster and 0 stands for any other tweets
classes = df['target'].value_counts()
classes

In [None]:
#Creating a simple dataframe with percentage of each class
class_0 = classes[0]/df['target'].count()*100
class_1 = classes[1]/df['target'].count()*100
dist_df = pd.DataFrame({'Percentage':[class_0,class_1]},index=['Normal_Tweets','Disaster_Tweets'])
dist_df.style.background_gradient(cmap='coolwarm')

In [None]:
#Barplot for the classes
plt.title("Percentage of Tweet Classes",fontweight='bold')
sns.barplot(x=dist_df.index,y=dist_df['Percentage'],palette='Blues')
plt.show()

We can see the train dataset is almost balanced, with almost equal proportion of tweets from either classes.

We know from the data description that `Keyword` is a particular important word from the tweet. Hence we can observe this column to understand the possible tweet mentions. We can see the top few most often listed keywords.

In [None]:
df['keyword'].value_counts()[:10]

Some or all of these words looks like real disaster mentions. Maybe what we can do is creating separate dataframes for disaster/non-disaster tweets and then observing the mostly mentioned keywords.

In [None]:
df_not_disaster = df.loc[df['target']==0]
df_disaster = df.loc[df['target']==1]

In [None]:
#Top 10 keywords in the disaster tweets
df_disaster['keyword'].value_counts()[:10]

In [None]:
#Top 10 keywords in the other tweets
df_not_disaster['keyword'].value_counts()[:10]

In [None]:
#Barplots for the above
#Interpretation is always easier with the plots
plt.figure(figsize=(16,5))
plt.subplot(121)
plt.xlabel('Mentions')
plt.title('Top 10 keywords - DISASTER',fontweight='bold')
sns.barplot(y=df_disaster['keyword'].value_counts()[:10].index,x=df_disaster['keyword'].value_counts()[:10])
plt.subplot(122)
plt.xlabel('Mentions')
plt.title('Top 10 keywords - OTHER',fontweight='bold')
sns.barplot(y=df_not_disaster['keyword'].value_counts()[:10].index,x=df_not_disaster['keyword'].value_counts()[:10])
plt.show()

An interesting thing what we can observe here is even for tweets not mentioned as disaster, there are some keywords like fear,siren,explode being used. Maybe these words are used in a sarcastic or funny context.

We can now see about the location from which these tweets are posted.

In [None]:
#Top 10 locations based on the count
df['location'].value_counts()[:10]

Most number of tweets in the dataset are from USA. But interestingly, there are some issues with this column, where we can a different mention of Unites states and also some cities listed seperately and not adding to the tally of the country.<br> We have also seen a huge amount of missing values also for this column.<br> Since our primary goal is around the text field, we can shift our focus from this, otherwise should have handled.

#### TWEET LENGTH - DISASTER AND NORMAL

We can now see some of the characteristics related to the tweet length

In [None]:
#Maximum Tweet Length
df['text'].str.len().max()

In [None]:
#Tweet with maximum length present in the train set
df.loc[df['text'].str.len()==df['text'].str.len().max()]['text']

In [None]:
#Minimum Tweet length
df['text'].str.len().min()

In [None]:
df.loc[df['text'].str.len()==df['text'].str.len().min()]

In [None]:
#Average tweet length
df['text'].str.len().mean()

The average Tweet length seems to be around 101 characters.

In [None]:
#Average Word length of the tweet of our train corpus
df['text'].str.split().apply(lambda x: len(x)).mean()

In [None]:
#Average Word length of the tweet for seperate Disaster and other set
wl_not_disaster = df_not_disaster['text'].str.split().apply(lambda x: len(x))
wl_disaster = df_disaster['text'].str.split().apply(lambda x: len(x))
print(wl_not_disaster.mean())
print(wl_disaster.mean())

In [None]:
#Tweets with maximum word count in our train set
df.loc[df['text'].str.split().apply(lambda x: len(x))==df['text'].str.split().apply(lambda x: len(x)).max()]

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.title("Mean Word count-Other Tweets",fontweight='bold')
sns.distplot(wl_not_disaster.map(lambda x: np.mean(x)),color='grey')
plt.xlabel('count')
plt.subplot(122)
plt.title("Mean Word count-Disaster Tweets",fontweight='bold')
sns.distplot(wl_disaster.map(lambda x: np.mean(x)))
plt.xlabel('count')
plt.show()

## LEXICAL PROCESSING

So as we have said earlier, we are interested in the text column here, which contains the actualy tweets. There can be possible cleanliness issues in the text. Maybe some links attached, some hashtags and any other issues which needs to be address. We can try observing the same now.

In [None]:
#Trying to find if there are tweets present with a web link.
df.loc[df['text'].str.contains('http')]

Oops thats a huge number - 3971 tweets of our train contains links alongside!

In [None]:
#Trying to find if there are hashtags alongside
df.loc[df['text'].str.contains('#')]

Now what we can do is try build a text corpus with all the words present in these tweets. For this, we can try to pick up each tweet in the dataset break down into words and try and append into a single list.

In [None]:
def corpus_build(column):
    """Function to create a corpus list for all the words present in the tweets.Pass in the 
    dataframe column"""
    text_corpus = []
    for i in column.str.split():
        for word in i:
            text_corpus.append(word)
    return text_corpus

In [None]:
#Text_corpus for the train dataset tweets
text_corpus = corpus_build(df['text'])
#Text corpus for the test dataset tweets
text_corpus_test = corpus_build(df_test['text'])

In [None]:
#Total Number of words present in the tweets
len(text_corpus)

Now what we have to keep in mind is that there are words which would be used more than once with different tweets. Also, there can be many stopwords present with the tweets which doesnt add any value to the actual meaning of the sentence, but rather can be seen as just sentence builder/connectors.

In [None]:
#Importing the stopwords
from nltk.corpus import stopwords

In [None]:
#We can see the stopwords listed with this.
print(stopwords.words('english'))

These words as you can observe, needs to be removed since doesnt provide any value adds. Before removing such words if any from our tweets, we can run a quick check for this as well.

In [None]:
#Count of stopwords present in our tweets
corpus_stopwords = {}
for word in text_corpus:
    if word in stopwords.words('english'):
        if word in corpus_stopwords:
            corpus_stopwords[word] += 1
        else:
            corpus_stopwords.update({word: 1})
corpus_stopwords

In [None]:
#We will try to sort this words in terms of frequency - higher to lower and find out top10 frequent stopwords
corpus_stopwords_sorted = sorted(corpus_stopwords.items(),key=lambda x:x[1],reverse=True)

In [None]:
corpus_stopwords_10 = corpus_stopwords_sorted[:10]
corpus_stopwords_10

In [None]:
top_corpus_stopwords = pd.DataFrame(corpus_stopwords_10,columns=["Word","Frequency"])
top_corpus_stopwords.style.background_gradient(cmap='Blues')

In [None]:
#Treemap for the corpus top stopwords
fig = px.treemap(top_corpus_stopwords,path=['Word'],values='Frequency',title="Top 10 stopwords in the corpus")
fig.show()

In [None]:
keys = []
values = []
for i in corpus_stopwords_10:
    keys.append(i[0])
    values.append(i[1])

In [None]:
#Plotting the top appearing stopwords and their corresponding frequency
plt.title("Top appearing STOPWORDS",fontweight='bold')
plt.bar(keys,values,color='grey')
plt.show()

In [None]:
# The top appearing stopwords in our corpus are -
keys

NOW BACK TO BUSINESS!!!

We are having a basic idea now, about things to be taken care of as part of Data Cleaning or Text preprocessing here - 
1. Removing the Website Links present within the tweets
2. Removing Hash associated with hashtags or any special character present with the tweets
3. Converting all text into either a single case - either Lower or Upper
4. Removing Line Breaks if any
5. Tokenization
6. Handling the stopwords
7. Stemming/Lemmatization

### 1.Removing the Website Links

In [None]:
#Example tweet with link present. We will try to check function on this
link ="Link to Regex basics - https://www.w3schools.com/python/python_regex.asp"

In [None]:
#Importing the regular expression function
import re

In [None]:
#Function to remove the links in the text
def remove_url(input):
    """Function to remove the URLs present in the text. Feed in the text data as input to function"""
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',input)

In [None]:
remove_url(link)

In [None]:
#Some of the records with urls before our process
df.iloc[31:34]

In [None]:
df['text'] = df['text'].apply(remove_url)
df_test['text'] = df_test['text'].apply(remove_url)

In [None]:
#After url removal
df.iloc[31:34]

### 2. Removing Hashes and other punctuation

In [None]:
#Importing the String module
import string

In [None]:
#Python provides a constant called string.punctuation that provides a great list of punctuation characters. 
print(string.punctuation)

In [None]:
def remove_punctuation(input1):
    """To remove all the punctuations present in the text. Input the text to the function"""
    table = str.maketrans('','',string.punctuation)
    return input1.translate(table)

In [None]:
#Some of the records with hash before our process
df.iloc[3:6]

In [None]:
df['text'] = df['text'].apply(remove_punctuation)
df_test['text'] = df_test['text'].apply(remove_punctuation)

In [None]:
#After Punctuation removal
df.iloc[3:6]

### 3. Converting all text to Lower

In [None]:
#Converting text column to all lowercase
df['text'] = df['text'].str.lower()
df_test['text'] = df_test['text'].str.lower()

### 4. Removing Line Breaks

In [None]:
df.loc[df['text'].str.contains("\n")][:5]

We can find that there are some line breaks present in different tweets which we will try removing.

In [None]:
def remove_linebreaks(input1):
    """Function to remove the line breaks  present in the text. Feed in the text data as input to function"""
    text = re.compile(r'\n')
    return text.sub(r' ',input1)

In [None]:
df['text'] = df['text'].apply(remove_linebreaks)
df_test['text'] = df_test['text'].apply(remove_linebreaks)

### 5. Tokenization

Tokenization is a technique used to split the text into smaller elements, which can be either words,sentences or even paragraphs.<br>
Rather than using split() function to tokenize these tweets to words, we can try use any nltk tokenizers.<br>

In [None]:
#Importing the word_tokenize
from nltk.tokenize import word_tokenize

In [None]:
#We can tokenize all the tweets using word_tokenize
df['text'] = df['text'].apply(word_tokenize)
df_test['text'] = df_test['text'].apply(word_tokenize)

In [None]:
df.head()

### 6. Stopwords Removal

In [None]:
def remove_stopwords(input1):
    """Function to remove the stopwords present in the text. Feed in the text data as input to function"""
    words = []
    for word in input1:
        if word not in stopwords.words('english'):
            words.append(word)
    return words

In [None]:
df['text']=df['text'].apply(remove_stopwords)
df_test['text'] = df_test['text'].apply(remove_stopwords)

In [None]:
df.head(10)

If you observe the tweets now, we can see we have got a cleaner version now and all tweets represented with just keywords only now. <br>
But you can also observe some other issues as well <br>
Say for example - see the 6th row; flood-flooding represents a single word which can be handled by techniques like Stemming and Lemmatization.<br>
It is always better to make sure different variations of a word to be represented by a single word hence maintaining the needed importance of the word. <br> These methods are part of token normalisation.

### 7. Stemming and Lemmatization

Both Stemming and Lemmatization are almost same, but there is some difference between them.<br>
Stemming is a rule based technique which chops off the suffix of the word to get root form.<br>
Popular Stemmers -> `Porter Stemmer` and `Snowball Stemmer`<br>
Lemmarization searches for base word or lemma by recursively going through all the variations of dictionary words.
Popular Lemmatizer -> `WordNet Lemmatizer`

We will observe the difference between these and select the appropriate one for this scenario.

In [None]:
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()

In [None]:
def lemma_wordnet(input1):
    """Lemmatization function"""
    return [lem.lemmatize(w) for w in input1]

In [None]:
df['text'].apply(lemma_wordnet)[:10]

In [None]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [None]:
def stemming_porter(input1):
    """Stemming using Porter Stemmer"""
    return [stemmer.stem(w) for w in input1]

In [None]:
df['text'].apply(stemming_porter)[:10]

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer_snowball = SnowballStemmer("english")

In [None]:
def stemming_snowball(input1):
    """Stemming using Snowball Stemmer"""
    return [stemmer_snowball.stem(w) for w in input1]

In [None]:
df['text'].apply(stemming_snowball)[:10]

Comparing all three, I guess it would be better to go with Lemmatization in our case.

In [None]:
df['text'] = df['text'].apply(lemma_wordnet)
df_test['text'] = df_test['text'].apply(lemma_wordnet)

In [None]:
df.head()

In [None]:
def combine_text(input1):
    """Function to combine the list words"""
    combined = ' '.join(input1)
    return combined

In [None]:
df['text'] = df['text'].apply(combine_text)
df_test['text'] = df_test['text'].apply(combine_text)

In [None]:
df.head()

## BAG OF WORDS VS TF-IDF

Since the data with us is in text format, there is a need to convert these appropriate inorder to feed to our ML models. Both `Bag of Words` and `TF-IDF` are ways of accomplishing these task.<br>
Bag of Words represents each document or sentence as a seperate row and with every word in the complete set of documents as a column, wherein you fill either the presence of the word in the document with a 0/1 or you fill the frequency of the word in the document, hence building up a matrix kind of representation. <br>
But the problem here with the bag of words model is that, it gives equal importance to each word, but we know that is not the case everywhere. Some words might be more important than the other, based on their usage in a single sentence/document or if it is more frequently present throughout the documents. This is taken care by TF-IDF(Term Frequency-Inverse Document Frequency) representation.<br>
TF-IDF gives more weight to those words which are present frequently in a document but are rare across other documents.

#### BAG OF WORDS REPRESENTATION

We will use CountVectorizer for this. This converts a collection of text documents to a matrix of token counts.

In [None]:
#Importing the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#Object for the CountVectorizer function
vectorizer = CountVectorizer()
bow_model_train = vectorizer.fit_transform(df['text'])
bow_model_test = vectorizer.transform(df_test['text'])

In [None]:
#Complete sparse array
bow_model_train.toarray()

#### TF-IDF REPRESENTATION

<b>Term Frequency = Frequency of a term in the document / Total terms in that document <br>
Inverse Document Frequency = log10 Total Number of Documents / Total documents that have the term </b>

In [None]:
#Importing TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer_tfidf = TfidfVectorizer()
tfidf_model_train = vectorizer_tfidf.fit_transform(df['text'])
tfidf_model_test = vectorizer_tfidf.transform(df_test['text'])

In [None]:
#Complete sparse array
tfidf_model_train.toarray()

## BIGRAMS

In [None]:
#CountVectorizer with ngram_range=(2,2) will give us bigrams. We will fit_transform our text column with this.
vectorizer_bigram = CountVectorizer(ngram_range=(2,2),analyzer='word')
sparse_matrix = vectorizer_bigram.fit_transform(df['text'])

In [None]:
#We are creating here a dataframe for the bigrams which shows the frequency of this bigrams
frequencies = sum(sparse_matrix).toarray()[0]
bigram_df = pd.DataFrame(frequencies,index=vectorizer_bigram.get_feature_names(),columns=['frequency'])

In [None]:
#Sorting the bigram dataframe based on the frequency
bigram_df.sort_values(['frequency'],axis=0,ascending=False,inplace=True)

In [None]:
#Top bigrams from our train tweets
bigram_df[:10].style.background_gradient(cmap='Purples')

In [None]:
bigram_df.reset_index(inplace=True)

In [None]:
bigram_df_top20 = bigram_df[:20]

In [None]:
fig = px.treemap(bigram_df_top20,path=['index'],values='frequency',title='Tree of most occuring Bigrams')
fig.show()

* We can find quite some bigram words which actually denotes some disaster such as - `suicide bomber`,`burning building`,`mass murder` etc. <br>
We can also find some other tweets with large no of mentions with bigrams such as - `youtube video`,`prebreak best` etc.<br>
Also we can see 41 mentions of the bigram `Northern California` which can depict either most of the tweets are coming from this location or the data itself might be collected mostly from this location.

## XGBOOST

In [None]:
#Importing the xgboost
import xgboost as xgb

In [None]:
#Setting the hyperparameters for the xgb model
xgb_param = xgb.XGBClassifier(max_depth=5, n_estimators=300, colsample_bytree=0.8, 
                                subsample=0.8, nthread=10, learning_rate=0.1)

In [None]:
#Importing the model_selection
from sklearn import model_selection

In [None]:
#Cross Validation scores with XGBoost model and bag of words representaion
scores = model_selection.cross_val_score(xgb_param, bow_model_train, df["target"], cv=5, scoring="f1")
scores

In [None]:
#Cross Validation scores with XGBoost model and TF-IDF representaion
scores = model_selection.cross_val_score(xgb_param, tfidf_model_train, df["target"], cv=5, scoring="f1")
scores

We can see that our F1 score metrics are really low(around 53% only) with XGBoost model. Hence we can try it out with other models as well.

## NAIVE BAYES

In [None]:
#Importing
from sklearn.naive_bayes import MultinomialNB

In [None]:
mnb = MultinomialNB()

In [None]:
scores = model_selection.cross_val_score(mnb, bow_model_train, df["target"], cv=5, scoring="f1")
scores

In [None]:
scores = model_selection.cross_val_score(mnb, tfidf_model_train, df["target"], cv=5, scoring="f1")
scores

We are getting a better score with the Naive Bayes model and with TF-IDF. Hence we will try to apply this on the train and test.

In [None]:
mnb.fit(tfidf_model_train,df["target"])

In [None]:
df_submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")

In [None]:
df_submission.shape

In [None]:
df_submission['target'] = mnb.predict(tfidf_model_test)

In [None]:
df_submission.loc[df_submission['target']==1].shape[0]

In [None]:
df_submission.to_csv("submission.csv",index=False)

### REFERENCES

- https://www.kaggle.com/szelee/simpletransformers-hyperparam-tuning-k-fold-cv
- https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert
- https://www.kaggle.com/holfyuen/basic-nlp-on-disaster-tweets
- https://www.kaggle.com/parulpandey/getting-started-with-nlp-feature-vectors
- https://www.kaggle.com/parulpandey/getting-started-with-nlp-a-general-intro
- https://www.kaggle.com/friskycodeur/nlp-with-disaster-tweets-bert-explained
- https://www.kaggle.com/sahib12/document-embedding-techniques



### <font color = 'brown'> Please do upvote if you really liked my Kernel. It would really push me to write more of such notebooks! CHEERS!