# Real or Not? NLP with Disaster Tweets
<br>
<img src='https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTbmtImbdYE8HEt_GzzxuWvTAXcNTzdk-vC0q3q5wtzbWniXvQG' alt='twitter' style='float:left' width=100%>
<div style='clear:both'></div>
<hr>
**Welcome all 😊**<br>

In this kernel we will go together into **Disaster Tweets** data to learn how use basic natural language processing **NLP** techniques<br>

This kernel will be devided into the following parts<br>
<ol>
    <li><b>Data Exploration</b></li>
    <li><b>Data Preprocessing</b></li>
    <li><b>Basic NLP Techniques</b></li>
    <li><b>Models Bulding</b></li>
    <li><b>Models evaluation</b></li>
</ol>
Now we will import libraries and load our data.

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings('ignore')

print("Important libraries loaded successfully")

# 1. Data Exploration

In [None]:
data_train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
print("Data shape = ",data_train.shape)
data_train.head()

As we see in above table we have some missing values. So let's deal with it

# 2. Data Preprocessing
Data Preprocessing one of important steps in any data science or machine learning project so let's start.
## 2.1 Missing Data

In [None]:
#get total count of data including missing data
total = data_train.isnull().sum().sort_values(ascending=False)

#get percent of missing data relevant to all data
percent = (data_train.isnull().sum()/data_train.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(data_train.shape[1])

As we see in above table almost 33% of **location** column is missing and very littel percentage of **keyword** column is missing.<br>

## 2.2 How to Handle Missing Data ?
One of the most common problems we have faced in Data Analysis is handling the missing values.<br>

I love put this image in my kernels because it give a roadmap to handle **missing data** 
<img src='https://miro.medium.com/max/1528/1*_RA3mCS30Pr0vUxbp25Yxw.png' width="550px" style='float:left;'>
<div style='clear:both'></div>
<br>

In **Deletion** I will use **Deleting Columns** technique. Now we will drop **location** and **keyword** columns.

In [None]:
data_train = data_train.drop(['location','keyword'], axis=1)
print("location and keyword columns droped successfully")

We all know that **id** column isn't important to us, so we will drop it

In [None]:
data_train = data_train.drop('id', axis=1)
print("id column droped successfully")

Now we only have text and target columns only. let's make sure

In [None]:
data_train.columns

# 3. Basic NLP Techniques
<img src='https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcT6TkPpD8nWsbTVa9ExwfCQUnFmzkNE8zjZJ3uXSaBVd09ErhvZ' alt='text preprocessing' style='float:left' width=50% >
<div style='clear:both'></div>
<hr>
Before starting text preprocess steps we must we must know two terms **Corpus and Bag of word.**<br>

**Corpus :** Is a large and structured set of texts, We can consider it as simplified version of our text data that contain clean and benefit data.<br>

**Bag of word :** In practice, the Bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a "bag of words", we can calculate various measures to characterize the text [wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)<br>

Now we will do the following steps to preprocess our text data
<ol>
    <li><b>Remove unwanted words</b></li>
    <li><b>Transform words to lowercase</b></li>
    <li><b>Remove stopwords</b></li>
    <li><b>Stemming words</b></li>
    <li><b>Create sparse matrix ( Bag of words )</b></li>
</ol>  
Now let's deal with our **text** column by exploar it.

In [None]:
data_train["text"].head(10)

## 3.1 Remove unwanted words
As we see our **text** column contain unwanted words as **#, =>, numbers, or ... etc** these letters will not be useful in our problem so we will get only pure text without any markings or numbers.<br>

We will do it by **specify** our pattern using **re** library.

## 3.2 Transform words to lowercase
We must transform words to lowercase because each letter has own **ASCII Code** that represent text in computers, Uppercase letter has different ASCII Code than same letter in lowercase format. **so that** 'A' letter differ from 'a' letter in computer.

## 3.3 Remove stopwords
**Stop words :** are generally the most common words in a language, so we will remove it to prevent misleading problem in our model.

## 3.4 Stemming words
**stemming :** is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form [wikipedia](https://en.wikipedia.org/wiki/Stemming)<br>
We use stemming to reduce **bag of words** dimensionality.

In [None]:
corpus  = []
pstem = PorterStemmer()
for i in range(data_train['text'].shape[0]):
    #Remove unwanted words
    tweet = re.sub("[^a-zA-Z]", ' ', data_train['text'][i])
    #Transform words to lowercase
    tweet = tweet.lower()
    tweet = tweet.split()
    #Remove stopwords then Stemming it
    tweet = [pstem.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    #Append cleaned tweet to corpus
    corpus.append(tweet)
    
print("Corpus created successfully")    

**Let's explore corpus, and discover the difference between raw and clean text data**

In [None]:
print(pd.DataFrame(corpus)[0].head(10))

In [None]:
rawTexData = data_train["text"].head(10)
cleanTexData = pd.DataFrame(corpus, columns=['text after cleaning']).head(10)

frames = [rawTexData, cleanTexData]
result = pd.concat(frames, axis=1, sort=False)
result

As we know that there some words that repeated so little in our tweets, so we must remove these words from our **Bag of words** to decrease dimensionality as possible.<br>

We will do it by create dictionary where **key** refer to **word** and **value** refer to **word frequents in all tweets**.

In [None]:
#Create our dictionary 
uniqueWordFrequents = {}
for tweet in corpus:
    for word in tweet.split():
        if(word in uniqueWordFrequents.keys()):
            uniqueWordFrequents[word] += 1
        else:
            uniqueWordFrequents[word] = 1
            
#Convert dictionary to dataFrame
uniqueWordFrequents = pd.DataFrame.from_dict(uniqueWordFrequents,orient='index',columns=['Word Frequent'])
uniqueWordFrequents.sort_values(by=['Word Frequent'], inplace=True, ascending=False)
uniqueWordFrequents.head(10)

In [None]:
uniqueWordFrequents['Word Frequent'].unique()

As we see some words repeated a lot and others repeated less, so we will get only words that repeated more than or equal 20 once.

In [None]:
uniqueWordFrequents = uniqueWordFrequents[uniqueWordFrequents['Word Frequent'] >= 20]
print(uniqueWordFrequents.shape)
uniqueWordFrequents

## 3.5 Create sparse matrix ( Bag of words )
**Bag of word** contain only unique words in corpus.

In [None]:
counVec = CountVectorizer(max_features = uniqueWordFrequents.shape[0])
bagOfWords = counVec.fit_transform(corpus).toarray()

# 4. Models Bulding
Now we will build our models, we will use following models
* Decision Tree Model
* Gradient Boosting Model
* K - Nearest Neighbors Model
* Logistic Regression Model
* Stochastic Gradient Descent Model
* Support Vector Machine Model
* Bernoulli Naive Bayes Model
* Gaussian Naive Bayes Model
* Multinomial Naive Bayes Model
* Voting Classifier Model

But before using it we will split our data to train and test set first.

In [None]:
X = bagOfWords
y = data_train['target']
print("X shape = ",X.shape)
print("y shape = ",y.shape)

X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.20, random_state=55, shuffle =True)
print('data splitting successfully')

## 4.1 Decision Tree Model

In [None]:
decisionTreeModel = DecisionTreeClassifier(criterion= 'entropy',
                                           max_depth = None, 
                                           splitter='best', 
                                           random_state=55)

decisionTreeModel.fit(X_train,y_train)

print("decision Tree Classifier model run successfully")

## 4.2 Gradient Boosting Model

In [None]:
gradientBoostingModel = GradientBoostingClassifier(loss = 'deviance',
                                                   learning_rate = 0.01,
                                                   n_estimators = 100,
                                                   max_depth = 30,
                                                   random_state=55)

gradientBoostingModel.fit(X_train,y_train)

print("gradient Boosting Classifier model run successfully")

## 4.3 K - Nearest Neighbors Model

In [None]:
KNeighborsModel = KNeighborsClassifier(n_neighbors = 7,
                                       weights = 'distance',
                                      algorithm = 'brute')

KNeighborsModel.fit(X_train,y_train)

print("KNeighbors Classifier model run successfully")

## 4.4 Logistic Regression Model

In [None]:
LogisticRegression = LogisticRegression(penalty='l2', 
                                        solver='saga', 
                                        random_state = 55)  

LogisticRegression.fit(X_train,y_train)

print("LogisticRegression Classifier model run successfully")

## 4.5 Stochastic Gradient Descent Model

In [None]:
SGDClassifier = SGDClassifier(loss = 'hinge', 
                              penalty = 'l1',
                              learning_rate = 'optimal',
                              random_state = 55, 
                              max_iter=100)

SGDClassifier.fit(X_train,y_train)

print("SGDClassifier Classifier model run successfully")

## 4.6 Support Vector Machine Model

In [None]:
SVClassifier = SVC(kernel= 'linear',
                   degree=3,
                   max_iter=10000,
                   C=2, 
                   random_state = 55)

SVClassifier.fit(X_train,y_train)

print("SVClassifier model run successfully")

## 4.7 Bernoulli Naive Bayes Model

In [None]:
bernoulliNBModel = BernoulliNB(alpha=0.1)
bernoulliNBModel.fit(X_train,y_train)

print("bernoulliNB model run successfully")

## 4.8 Gaussian Naive Bayes Model

In [None]:
gaussianNBModel = GaussianNB()
gaussianNBModel.fit(X_train,y_train)

print("gaussianNB model run successfully")

## 4.9 Multinomial Naive Bayes Model

In [None]:
multinomialNBModel = MultinomialNB(alpha=0.1)
multinomialNBModel.fit(X_train,y_train)

print("multinomialNB model run successfully")

## 4.10 Voting Classifier Model

In [None]:
modelsNames = [('LogisticRegression',LogisticRegression),
               ('SGDClassifier',SGDClassifier),
               ('SVClassifier',SVClassifier),
               ('bernoulliNBModel',bernoulliNBModel),
               ('multinomialNBModel',multinomialNBModel)]

votingClassifier = VotingClassifier(voting = 'hard',estimators= modelsNames)
votingClassifier.fit(X_train,y_train)
print("votingClassifier model run successfully")

# 5 Models evaluation

Now we will evaluate our model using **f1_score** let's go. 

In [None]:
#evaluation Details
models = [decisionTreeModel, gradientBoostingModel, KNeighborsModel, LogisticRegression, 
          SGDClassifier, SVClassifier, bernoulliNBModel, gaussianNBModel, multinomialNBModel, votingClassifier]

for model in models:
    print(type(model).__name__,' Train Score is   : ' ,model.score(X_train, y_train))
    print(type(model).__name__,' Test Score is    : ' ,model.score(X_test, y_test))
    
    y_pred = model.predict(X_test)
    print(type(model).__name__,' F1 Score is      : ' ,f1_score(y_test,y_pred))
    print('--------------------------------------------------------------------------')

<p style='font-size:25px;font-weight:bold'>Please If you find this kernel useful, upvote it to help others see it 😊</p>