**`Import the libraries`**

In [1]:
import numpy as np 
import pandas as pd 
import nltk
from nltk.corpus import stopwords
import string

**`Loading & Viewing dataset`**

In [2]:
df = pd.read_csv(r"./train.csv", usecols=["text", "target"])
df.head(10)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
5,#RockyFire Update => California Hwy. 20 closed...,1
6,#flood #disaster Heavy rain causes flash flood...,1
7,I'm on top of the hill and I can see a fire in...,1
8,There's an emergency evacuation happening now ...,1
9,I'm afraid that the tornado is coming to our a...,1


In [3]:
print("col_names : \t" + df.columns)
print('\n')
print("Data-dimensions: \t" + str(df.shape))
print('\n')
print("Count the not-null values of each features: \n" + str(df.notnull().sum()))

Index(['col_names : \ttext', 'col_names : \ttarget'], dtype='object')


Data-dimensions: 	(7613, 2)


Count the not-null values of each features: 
text      7613
target    7613
dtype: int64


**`Checking for duplicates and removing them`**

In [4]:
df.drop_duplicates(inplace = True)
print("The new dimension after checking duplicate & removing is:\t" + str(df.shape))

The new dimension after checking duplicate & removing is:	(7521, 2)


`Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuations and then removing the useless words also known as stop words.`

**`Tokenization`** (a list of tokens), will be used as the analyzer

`1.Punctuations`, such as `[!"#$%&'()*+,-./:;<=>?@[\]^_``{|}~]`

`2.Stop words` in natural language processing, are useless words.

**`Viewing the punctuation in string_package`**

In [5]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

**`Build a function`**

In [6]:
def process_text(text):
    '''
    What will be covered:
    1. Remove punctuation
    2. Remove stopwords
    3. Return list of clean text words
    '''
    
    #1
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    #2
    clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
        
    #3
    return clean_words

In [7]:
df['text'].head(10).apply(process_text)

0    [Deeds, Reason, earthquake, May, ALLAH, Forgiv...
1        [Forest, fire, near, La, Ronge, Sask, Canada]
2    [residents, asked, shelter, place, notified, o...
3    [13000, people, receive, wildfires, evacuation...
4    [got, sent, photo, Ruby, Alaska, smoke, wildfi...
5    [RockyFire, Update, California, Hwy, 20, close...
6    [flood, disaster, Heavy, rain, causes, flash, ...
7                    [Im, top, hill, see, fire, woods]
8    [Theres, emergency, evacuation, happening, bui...
9                  [Im, afraid, tornado, coming, area]
Name: text, dtype: object

**`Remove the links (such as url, html)`**

In [8]:
import re

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

## Example


In [9]:
def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'',text)

**`Build a function to correct the wrong-word`**

In [10]:
from spellchecker import SpellChecker
spell = SpellChecker()

def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)        

df['text'].head(10).apply(correct_spellings).apply(process_text)

0    [Deeds, Reason, earthquake, May, ALLAH, Forgiv...
1        [Forest, fire, near, La, Ronge, Sask, Canada]
2    [residents, asked, shelter, place, notified, o...
3    [r3000, people, receive, wildfire, evacuation,...
4    [got, sent, photo, Ruby, Alaska, smoke, wildfi...
5    [RockyFire, Update, California, Hwy, 20, close...
6    [flood, disaster, Heavy, rain, causes, flash, ...
7                    [Im, top, hill, see, fire, woods]
8    [Theres, emergency, evacuation, happening, bui...
9                  [Im, afraid, tornado, coming, area]
Name: text, dtype: object

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
text_process = CountVectorizer(analyzer = process_text).fit_transform(df['text'])

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text_process, df['target'], 
                                                    test_size = 0.20, 
                                                    stratify = df['target'], 
                                                    random_state = 42)

In [13]:
#Get the shape of text_process
text_process.shape

(7521, 26473)

In [14]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
#Print the predictions
print(classifier.predict(X_train))
print('\n')
#Print the actual values
print(y_train.values)

[1 1 1 ... 1 0 0]


[1 1 1 ... 1 0 0]


In [16]:
#Evaluate the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

pred = classifier.predict(X_train)

print(classification_report(y_train, pred ))
print('Confusion Matrix: \n',confusion_matrix(y_train, pred))
print()
print('Accuracy: ', accuracy_score(y_train,pred))

              precision    recall  f1-score   support

           0       0.93      0.98      0.95      3452
           1       0.96      0.90      0.93      2564

    accuracy                           0.94      6016
   macro avg       0.95      0.94      0.94      6016
weighted avg       0.94      0.94      0.94      6016

Confusion Matrix: 
 [[3368   84]
 [ 268 2296]]

Accuracy:  0.9414893617021277


In [17]:
#Print the predictions
print('Predicted value: ',classifier.predict(X_test))
#Print Actual Label
print('Actual value: ',y_test.values)

Predicted value:  [1 1 1 ... 0 1 0]
Actual value:  [0 1 1 ... 0 0 0]


In [18]:
#Evaluate the model on the test data set
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
pred = classifier.predict(X_test)
print(classification_report(y_test, pred))
print('Confusion Matrix: \n', confusion_matrix(y_test, pred))
print()
print('Accuracy: ', accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.80      0.83      0.82       863
           1       0.76      0.73      0.74       642

    accuracy                           0.78      1505
   macro avg       0.78      0.78      0.78      1505
weighted avg       0.78      0.78      0.78      1505

Confusion Matrix: 
 [[714 149]
 [175 467]]

Accuracy:  0.7847176079734219


`Hence, based on the accuracy (train & test); then we must to modify this overfitting!`