In this case study, you have been given Twitter data collected from an anonymous twitter handle. With the help of a Naïve Bayes model, predict if a given tweet about a real disaster is real or fake.
1 = real tweet and 0 = fake tweet
Business Objectives - Ensure that the model correctly identifies real disaster tweets as accurately as possible to avoid missing out on critical information.Correctly identifying fake tweets to minimize the impact of misinformation.
Maximize -Maximize the recall for real disaster tweets.
Minimize -Minimize the number of fake disaster tweets incorrectly

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
import re

In [3]:
tweet=pd.read_csv("D:/Documents/Datasets/Disaster_tweets_NB.csv")
tweet

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [5]:
tweet.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
tweet.tail()

Unnamed: 0,id,keyword,location,text,target
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1
7612,10873,,,The Latest: More Homes Razed by Northern Calif...,1


In [9]:
tweet.shape

(7613, 5)

In [11]:
tweet.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

In [13]:
tweet.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


In [15]:
tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [17]:
tweet.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [19]:
#Let us clean the data
def cleaning_text(i):
    i=re.sub("[^A-Za-z""]+"," ",i).lower()
    #Let us declare empty list
    w=[]
    for word in i.split(" "):
        if len(word)>3:
            w.append(word)
    return("  ".join(w))
#Let us check the function
cleaning_text("Hello what are you doing")
cleaning_text("Hi how are you ")

''

In [21]:
tweet.text=tweet.text.apply(cleaning_text)

In [23]:
tweet.drop(["id","keyword","location"],axis=1,inplace=True)

In [25]:
tweet=tweet.loc[tweet.text !="",:]

In [27]:
from sklearn.model_selection import train_test_split
tweet_train,tweet_test=train_test_split(tweet,test_size=0.2)

In [29]:
#let us first tokenize the message
def split_into_words(i):
    return[word for word in i.split(" ")]

tweet_bow=CountVectorizer(analyzer=split_into_words).fit(tweet.text)
#Apply to whole data
all_tweet_matrix=tweet_bow.transform(tweet.text)
#Apply to training messages
train_tweet_matrix=tweet_bow.transform(tweet_train.text)
#Apply to test_tweet
test_tweet_matrix=tweet_bow.transform(tweet_test.text)

In [31]:
#Apply to TFIDF Transformer
tfidf_transformer=TfidfTransformer().fit(all_tweet_matrix)
#Apply to train_tweet_matrix
train_tfidf=tfidf_transformer.transform(train_tweet_matrix)
train_tfidf.shape
#apply to test_tweet_matrix
test_tfidf=tfidf_transformer.transform(test_tweet_matrix)
test_tfidf.shape

(1522, 19280)

In [33]:
#Apply to Naive model
from sklearn.naive_bayes import MultinomialNB as MB
classifier_mb=MB()
#let us train the model
classifier_mb.fit(train_tfidf,tweet_train.target)

In [35]:
# Evaluate the model with test data
test_pred_m=classifier_mb.predict(test_tfidf)
#Accuracy 
accuracy_test_m=np.mean(test_pred_m==tweet_test.target)
accuracy_test_m


0.8022339027595269

In [37]:
from sklearn.metrics import accuracy_score
pd.crosstab(test_pred_m,tweet_test.target)

target,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,804,242
1,59,417


In [39]:
#Evaluate the model with train data
train_pred_m=classifier_mb.predict(train_tfidf)
accuracy_train_m=np.mean(train_pred_m==tweet_train.target)
accuracy_train_m
#check the confusion matrix
pd.crosstab(train_pred_m,tweet_train.target)

target,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3411,561
1,65,2051


In [43]:
#Multinomial Naive Bayes with laplace smoothing
classifier_mb_lap=MB(alpha=0.25)
classifier_mb_lap.fit(train_tfidf,tweet_train.target)
#Evaluate test data
test_pred_lap=classifier_mb_lap.predict(test_tfidf)
accuracy_test_lap=np.mean(test_pred_lap==tweet_test.target)
accuracy_test_lap

from sklearn.metrics import accuracy_score
pd.crosstab(test_pred_lap,tweet_test.target)

target,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,757,197
1,106,462
