![](http://ushuaianoticias.com/wp-content/uploads/2019/10/twitter.jpg)

## About the competition
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster

## <font color='blue'>In this notebook we will explore the given data in depth and build basic model.</font>


##### References : This notebook was refered from [this](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) notebook. And also readability features and topic modelling was taken from this [notebook](https://www.kaggle.com/thebrownviking20/analyzing-quora-for-the-insinceres)<br>
##### Thanks to the author of the kernels [SRK](https://www.kaggle.com/sudalairajkumar) and [Siddharth Yadav](https://www.kaggle.com/thebrownviking20) <br>
##### Other references and credits : [https://monkeylearn.com/blog/introduction-to-topic-modeling/](https://monkeylearn.com/blog/introduction-to-topic-modeling/) <br>

### <font color='red'>If you find this kernel useful please consider upvoting it 😊 which keeps me motivated for doing hard work and to produce more quality content.</font>

# Table of contents: <br>
1. Importing necessary modules <br>
2. Importing Dataframes <br>
3. Target value distribution <br>
4. Exploring Location column <br>
   4.1   Number of tweets according to location(top 20) <br>
   4.2   Number of tweets according to location per class (0 or1) <br>
   4.3   Visualising number of tweets per location on maps <br>
5. Word clouds of each class <br>
6. Word Frequency plots per each class(0 or 1) <br>
   6.1   Word Frequency <br>
   6.2   Bigram Plots <br>
   6.3   Trigram Plots <br>
7. Creating Meta Features <br>
   7.1   Plotting of meta features vs each target class (0 or 1) <br>
8. Histogram Plots <br>
   8.1   Histogram Plots of number of words per each class (0 or 1) <br>
   8.2   Histogram Plots of number of characters per each class (0 or 1) <br>
   8.3   Histogram Plots of number of punctuations per each class (0 or 1) <br>
   8.4   Histogram plots of number of words in train and test sets <br>
   8.5   Histogram plots of number of characters in train and test sets <br>
   8.6   Histogram plots of number of punctuations in train and test sets <br>
9. Readability features <br>
   9.1   The Flesch Reading Ease formula <br>
   9.2   The Flesch-Kincaid Grade Level <br>
   9.3   The Fog Scale (Gunning FOG Formula) <br>
   9.4   Automated Readability Index <br>
   9.5   The Coleman-Liau Index <br>
   9.6   Linsear Write Formula <br>
   9.7   Dale-Chall Readability Score <br>
   9.8  Readability Consensus based upon all the above tests <br>
10. Topic Modelling <br>
    10.1  Latent Dirichlet Allocation(LDA) <br>
    10.2  Printing keywords <br>
    10.3  Visualizing LDA results of not a real disaster tweets with pyLDAvis <br>
    10.4  Visualizing LDA results of a real disaster tweets with pyLDAvis <br>
11.  Tfidf Vectorization <br>
12.  Building basic Logistic regression model<br>
13.  Threshold value search for better score <br>
14.  Generating random tweets using Markov Chains <br>
     14.1  Building text model <br>
     14.2  Generating tweets about not a real disaster <br>
     14.3  Generating tweets about real disaster<br>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

#### Installing textstat package.

In [None]:
!pip install textstat

# 1. Importing necessary modules

In [None]:
import string
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline

from plotly import tools
import plotly.offline as py
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

from sklearn import model_selection, preprocessing, metrics, ensemble, naive_bayes, linear_model
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from statistics import *
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import textstat

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import folium 
from folium import plugins 


#utility functions:
def plot_readability(a,b,title,bins=0.1,colors=['#3A4750', '#F64E8B']):
    trace1 = ff.create_distplot([a,b], [" Real disaster tweets","Not real disaster tweets"], bin_size=bins, colors=colors, show_rug=False)
    trace1['layout'].update(title=title)
    py.iplot(trace1, filename='Distplot')
    table_data= [["Statistical Measures"," Not real disaster tweets","real disaster tweets"],
                ["Mean",mean(a),mean(b)],
                ["Standard Deviation",pstdev(a),pstdev(b)],
                ["Variance",pvariance(a),pvariance(b)],
                ["Median",median(a),median(b)],
                ["Maximum value",max(a),max(b)],
                ["Minimum value",min(a),min(b)]]
    trace2 = ff.create_table(table_data)
    py.iplot(trace2, filename='Table')

punctuations = string.punctuation
stopwords = list(STOP_WORDS)

parser = English()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

import re
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

def removeurl(raw_text):
    clean_text = re.sub(r'^https?:\/\/.*[\r\n]*', '', raw_text, flags=re.MULTILINE)
    return clean_text

# 2. Importing Dataframes

In [None]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
sub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

In [None]:
#glimpse at train dataset
train.head()

## Columns
* id - a unique identifier for each tweet
* text - the text of the tweet
* location - the location the tweet was sent from (may be blank)
* keyword - a particular keyword from the tweet (may be blank)
* target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [None]:
#glimpse at test dataset
test.head()

In [None]:
#some basic cleaning
train['text'] = train['text'].apply(lambda x:cleanhtml(x))
test['text'] = test['text'].apply(lambda x:cleanhtml(x))

#removing url tags
train['text'] = train['text'].apply(lambda x:removeurl(x))
test['text'] = test['text'].apply(lambda x:removeurl(x))


# 3. Target Value Distribution

In [None]:
cnt_srs = train['target'].value_counts()
trace = go.Bar(
    x=cnt_srs.index,
    y=cnt_srs.values,
    marker=dict(
        color=cnt_srs.values,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Target Count',
    font=dict(size=18)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="TargetCount")

## target distribution ##
labels = (np.array(cnt_srs.index))
sizes = (np.array((cnt_srs / cnt_srs.sum())*100))

trace = go.Pie(labels=labels, values=sizes)
layout = go.Layout(
    title='Target distribution',
    font=dict(size=18),
    width=600,
    height=600,
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="usertype")

*  We can observe that about 43% of tweets in the dataframe is about real disaster.

# 4. Exploring location column

## 4.1 Number of tweets according to location(top 20)

In [None]:
cnt_ = train['location'].value_counts()
cnt_.reset_index()
cnt_ = cnt_[:20,]
trace1 = go.Bar(
                x = cnt_.index,
                y = cnt_.values,
                name = "Number of tweets in dataset according to location",
                marker = dict(color = 'rgba(200, 74, 55, 0.5)',
                             line=dict(color='rgb(0,0,0)',width=1.5)),
                )

data = [trace1]
layout = go.Layout(barmode = "group",title = 'Number of tweets in dataset according to location')
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

*  Most of the tweets are from USA,London,Canada.

## 4.2 Number of tweets according to location per class (0 or1)

In [None]:
train1_df = train[train["target"]==1]
train0_df = train[train["target"]==0]
cnt_1 = train1_df['location'].value_counts()
cnt_1.reset_index()
cnt_1 = cnt_1[:20,]

cnt_0 = train0_df['location'].value_counts()
cnt_0.reset_index()
cnt_0 = cnt_0[:20,]

trace1 = go.Bar(
                x = cnt_1.index,
                y = cnt_1.values,
                name = "Number of tweets about real disaster location wise",
                marker = dict(color = 'rgba(255, 74, 55, 0.5)',
                             line=dict(color='rgb(0,0,0)',width=1.5)),
                )
trace0 = go.Bar(
                x = cnt_0.index,
                y = cnt_0.values,
                name = "Number of tweets other than real disaster location wise",
                marker = dict(color = 'rgba(79, 82, 97, 0.5)',
                             line=dict(color='rgb(0,0,0)',width=1.5)),
                )


data = [trace0,trace1]
layout = go.Layout(barmode = 'stack',title = 'Number of tweets in dataset according to location')
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)

* The graph says it all!!

## 4.3 Visualising number of tweets per location on maps

### Getting latitude and longitudes of locations for plotting.

In [None]:
df = train['location'].value_counts()[:20,]
df = pd.DataFrame(df)
df = df.reset_index()
df.columns = ['location', 'counts'] 
geolocator = Nominatim(user_agent="specify_your_app_name_here")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
dictt_latitude = {}
dictt_longitude = {}
for i in df['location'].values:
    print(i)
    location = geocode(i)
    dictt_latitude[i] = location.latitude
    dictt_longitude[i] = location.longitude
df['latitude']= df['location'].map(dictt_latitude)
df['longitude'] = df['location'].map(dictt_longitude)

In [None]:
map1 = folium.Map(location=[10.0, 10.0], tiles='CartoDB dark_matter', zoom_start=2.3)
markers = []
for i, row in df.iterrows():
    loss = row['counts']
    if row['counts'] > 0:
        count = row['counts']*0.4
    folium.CircleMarker([float(row['latitude']), float(row['longitude'])], radius=float(count), color='#ef4f61', fill=True).add_to(map1)
map1

* We can observe most of the tweets are from America region few are from Australia,India,Africa Etc..,

* NOTE: Adjust zoom for better visualisation. 

# 5. Word clouds of each class

In [None]:
from wordcloud import WordCloud, STOPWORDS

# Thanks : https://www.kaggle.com/aashita/word-clouds-of-various-shapes ##
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0), 
                   title = None, title_size=40, image_color=False):
    stopwords = set(STOPWORDS)
    more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
    stopwords = stopwords.union(more_stopwords)

    wordcloud = WordCloud(background_color='black',
                    stopwords = stopwords,
                    max_words = max_words,
                    max_font_size = max_font_size, 
                    random_state = 42,
                    width=800, 
                    height=400,
                    mask = mask)
    wordcloud.generate(str(text))
    
    plt.figure(figsize=figure_size)
    if image_color:
        image_colors = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
        plt.title(title, fontdict={'size': title_size,  
                                  'verticalalignment': 'bottom'})
    else:
        plt.imshow(wordcloud);
        plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                                  'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  
    
plot_wordcloud(train[train["target"]==1], title="Word Cloud of tweets if real disaster")

*  By observing the above word cloud we can see some words like earthquake,fire,wildfires etc.., which refer to real disasters.

In [None]:
plot_wordcloud(train[train["target"]==0], title="Word Cloud of tweets if not a real disaster")

*  By examining we can observe words like disney,Wrecked etc.., we need to explore more.

# 6. Word Frequency plots per each class(0 or 1)

## 6.1 Word Frequency 

In [None]:
from collections import defaultdict
train1_df = train[train["target"]==1]
train0_df = train[train["target"]==0]

## custom function for ngram generation ##
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]

## custom function for horizontal bar chart ##
def horizontal_bar_chart(df, color):
    trace = go.Bar(
        y=df["word"].values[::-1],
        x=df["wordcount"].values[::-1],
        showlegend=False,
        orientation = 'h',
        marker=dict(
            color=color,
        ),
    )
    return trace

## Get the bar chart from sincere questions ##
freq_dict = defaultdict(int)
for sent in train0_df["text"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(50), 'red')

## Get the bar chart from insincere questions ##
freq_dict = defaultdict(int)
for sent in train1_df["text"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(50), 'red')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,
                          subplot_titles=["Frequent words if tweet is not real disaster", 
                                          "Frequent words if tweet is real disaster"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Word Count Plots")
py.iplot(fig, filename='word-plots')



## 6.2 Bigram plots

In [None]:
freq_dict = defaultdict(int)
for sent in train0_df["text"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(50), 'green')


freq_dict = defaultdict(int)
for sent in train1_df["text"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(50), 'green')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,
                          subplot_titles=["Frequent words if tweet is not real disaster", 
                                          "Frequent words if tweet is real disaster"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Bigram Count Plots")
py.iplot(fig, filename='word-plots')

## 6.3 Trigram plots

In [None]:
freq_dict = defaultdict(int)
for sent in train0_df["text"]:
    for word in generate_ngrams(sent,3):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(50), 'orange')


freq_dict = defaultdict(int)
for sent in train1_df["text"]:
    for word in generate_ngrams(sent,3):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(50), 'orange')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04, horizontal_spacing=0.2,
                          subplot_titles=["Frequent words if tweet is not real disaster", 
                                          "Frequent words if tweet is real disaster"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=1200, paper_bgcolor='rgb(233,233,233)', title="Trigram Count Plots")
py.iplot(fig, filename='word-plots')

# 7. Creating Meta Features:

Now we will create some meta features and then look at how they are distributed between the classes. The ones that we will create are

1. Number of words in the text <br>
2. Number of unique words in the text <br>
3. Number of characters in the text<br>
4. Number of stopwords<br>
5. Number of punctuations<br>
6. Number of upper case words<br>
7. Number of title case words<br>
8. Average length of the words<br>

In [None]:
train["num_words"] = train["text"].apply(lambda x: len(str(x).split()))
test["num_words"] = test["text"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train["num_unique_words"] = train["text"].apply(lambda x: len(set(str(x).split())))
test["num_unique_words"] = test["text"].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text ##
train["num_chars"] = train["text"].apply(lambda x: len(str(x)))
test["num_chars"] = test["text"].apply(lambda x: len(str(x)))

## Number of stopwords in the text ##
train["num_stopwords"] = train["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
test["num_stopwords"] = test["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))

## Number of punctuations in the text ##
train["num_punctuations"] =train['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test["num_punctuations"] =test['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

## Number of title case words in the text ##
train["num_words_upper"] = train["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test["num_words_upper"] = test["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

## Number of title case words in the text ##
train["num_words_title"] = train["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test["num_words_title"] = test["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

## Average length of the words in the text ##
train["mean_word_len"] = train["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test["mean_word_len"] = test["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

## 7.1 Plotting of meta features vs each target class (0 or 1)

In [None]:
train['num_words'].loc[train['num_words']>60] = 100 #truncation for better visuals
train['num_punctuations'].loc[train['num_punctuations']>25] = 25 #truncation for better visuals
train['num_chars'].loc[train['num_chars']>350] = 350 #truncation for better visuals

f, axes = plt.subplots(3, 1, figsize=(10,20))
sns.boxplot(x='target', y='num_words', data=train, ax=axes[0])
axes[0].set_xlabel('Target', fontsize=12)
axes[0].set_title("Number of words in each class", fontsize=15)

sns.boxplot(x='target', y='num_chars', data=train, ax=axes[1])
axes[1].set_xlabel('Target', fontsize=12)
axes[1].set_title("Number of characters in each class", fontsize=15)

sns.boxplot(x='target', y='num_punctuations', data=train, ax=axes[2])
axes[2].set_xlabel('Target', fontsize=12)
axes[2].set_title("Number of punctuations in each class", fontsize=15)
plt.show()

# 8. Histogram plots

## 8.1 Histogram Plots of number of words per each class (0 or 1)

In [None]:
train1_df = train[train["target"]==1]
train0_df = train[train["target"]==0]

fig = go.Figure()
fig.add_trace(go.Histogram(x=train1_df['num_words'],name = 'Number of words in tweets about real disaster'))
fig.add_trace(go.Histogram(x=train0_df['num_words'],name = 'Number of words in tweets other than real disaster'))

# Overlay both histograms
fig.update_layout(barmode='stack')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

## 8.2 Histogram Plots of number of characters per each class (0 or 1)

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=train1_df['num_chars'],name = 'Number of chars in tweets about real disaster',marker = dict(color = 'rgba(200, 100, 0, 0.8)')))
fig.add_trace(go.Histogram(x=train0_df['num_chars'],name = 'Number of chars in tweets about real disaster',marker = dict(color = 'rgba(25, 133, 120, 0.8)')))

# Overlay both histograms
fig.update_layout(barmode='stack')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

## 8.3 Histogram Plots of number of punctuations per each class (0 or 1)

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=train1_df['num_punctuations'],name = 'Number of punctuations in tweets about real disaster',marker = dict(color = 'rgba(97, 175, 222, 0.8)')))
fig.add_trace(go.Histogram(x=train0_df['num_punctuations'],name = 'Number of punctuations in tweets other than real disaster',marker = dict(color = 'rgba(200, 10, 150, 0.8)')))

# Overlay both histograms
fig.update_layout(barmode='stack')
# Reduce opacity to see both histograms
fig.update_traces(opacity=1)
fig.show()

## 8.4 Histogram plots of number of words in train and test sets

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=train['num_words'],name = 'Number of words in training tweets',marker = dict(color = 'rgba(255, 0, 0, 0.8)')))
fig.add_trace(go.Histogram(x=test['num_words'],name = 'Number of words in testing tweets ',marker = dict(color = 'rgba(0, 187, 187, 0.8)')))

# Overlay both histograms
fig.update_layout(barmode='stack')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

## 8.5 Histogram plots of number of chars in train and test sets

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=train['num_chars'],name = 'Number of chars in training tweets',marker = dict(color = 'rgba(25, 13, 8, 0.8)')))
fig.add_trace(go.Histogram(x=test['num_chars'],name = 'Number of chars in testing tweets ',marker = dict(color = 'rgba(8, 25, 187, 0.8)')))

# Overlay both histograms
fig.update_layout(barmode='stack')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

## 8.6 Histogram plots of number of punctuations in train and test sets

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=train['num_punctuations'],name = 'Number of punctuations in training tweets',marker = dict(color = 'rgba(222, 111, 33, 0.8)')))
fig.add_trace(go.Histogram(x=test['num_punctuations'],name = 'Number of punctuations in testing tweets ',marker = dict(color = 'rgba(33, 111, 222, 0.8)')))

# Overlay both histograms
fig.update_layout(barmode='stack')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

# 9. Readability features 

### Readability is the ease with which a reader can understand a written text. In natural language processing, the readability of text depends on its content. It focuses on the words we choose, and how we put them into sentences and paragraphs for the readers to comprehend.

## 9.1 The Flesch Reading Ease formula <br>
* In the Flesch reading-ease test, higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read. The formula for the Flesch reading-ease score (FRES) test is
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd4916e193d2f96fa3b74ee258aaa6fe242e110e)

## Score - Difficulty <br>
* 90-100 - Very Easy
* 80-89 - Easy
* 70-79 - Fairly Easy
* 60-69 - Standard
* 50-59 - Fairly Difficult
* 30-49 - Difficult
* 0-29 - Very Confusing

Read More : [Wikipedia](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests)

In [None]:
tqdm.pandas()
fre_notreal = np.array(train["text"][train["target"] == 0].progress_apply(textstat.flesch_reading_ease))
fre_real = np.array(train["text"][train["target"] == 1].progress_apply(textstat.flesch_reading_ease))
plot_readability(fre_notreal,fre_real,"Flesch Reading Ease",20)

## 9.2 The Flesch-Kincaid Grade Level <br>
* These readability tests are used extensively in the field of education. The "Flesch–Kincaid Grade Level Formula" instead presents a score as a U.S. grade level, making it easier for teachers, parents, librarians, and others to judge the readability level of various books and texts. It can also mean the number of years of education generally required to understand this text, relevant when the formula results in a number greater than 10. The grade level is calculated with the following formula: <br>
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/8e68f5fc959d052d1123b85758065afecc4150c3) <br>

* A score of 9.3 means that a ninth grader would be able to read the document. <br>
* Read more: [Wikipedia](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests)

In [None]:
fkg_notreal = np.array(train["text"][train["target"] == 0].progress_apply(textstat.flesch_kincaid_grade))
fkg_real = np.array(train["text"][train["target"] == 1].progress_apply(textstat.flesch_kincaid_grade))
plot_readability(fkg_notreal,fkg_real,"Flesch Kincaid Grade",4,['#C1D37F','#491F21'])

## 9.3 The Fog Scale (Gunning FOG Formula) <br>
* In linguistics, the Gunning fog index is a readability test for English writing. The index estimates the years of formal education a person needs to understand the text on the first reading. For instance, a fog index of 12 requires the reading level of a United States high school senior (around 18 years old). <br>
* The formula to calculate Fog scale: <br>
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/84cd504cf61d43230ef59fbd0ecf201796e5e577)
* Read more : [Wikipedia](https://en.wikipedia.org/wiki/Gunning_fog_index)

In [None]:
fog_notreal = np.array(train["text"][train["target"] == 0].progress_apply(textstat.gunning_fog))
fog_real = np.array(train["text"][train["target"] == 1].progress_apply(textstat.gunning_fog))
plot_readability(fog_notreal,fog_real,"The Fog Scale (Gunning FOG Formula)",4,['#E2D58B','#CDE77F'])

## 9.4 Automated Readability Index <br>
* Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text.For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade.
* Formula to calculate ARI: <br>
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/878d1640d23781351133cad73bdf27bdf8bfe2fd) <br>
* Read More: [Wikipedia](https://en.wikipedia.org/wiki/Automated_readability_index)

In [None]:
ari_notreal = np.array(train["text"][train["target"] == 0].progress_apply(textstat.automated_readability_index))
ari_real = np.array(train["text"][train["target"] == 1].progress_apply(textstat.automated_readability_index))
plot_readability(ari_notreal,ari_real,"Automated Readability Index",10,['#488286','#FF934F'])

## 9.5 The Coleman-Liau Index <br>
* Returns the grade level of the text using the Coleman-Liau Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. <br>
* The Coleman–Liau index is calculated with the following formula: <br>
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/cae44bbb96eaaca26e6aaf3b65c342f69f3d49ce) <br>
L is the average number of letters per 100 words and S is the average number of sentences per 100 words. <br>
* Read More : [Wikipedia](https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index)

In [None]:
cli_notreal = np.array(train["text"][train["target"] == 0].progress_apply(textstat.coleman_liau_index))
cli_real = np.array(train["text"][train["target"] == 1].progress_apply(textstat.coleman_liau_index))
plot_readability(cli_notreal,cli_real,"The Coleman-Liau Index",10,['#8491A3','#2B2D42'])

## 9.6 Linsear Write Formula <br>
* Returns the grade level of the text using the Coleman-Liau Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document. <br>
* Read More : [Wikipedia](https://en.wikipedia.org/wiki/Linsear_Write)

In [None]:
lwf_notreal = np.array(train["text"][train["target"] == 0].progress_apply(textstat.linsear_write_formula))
lwf_real = np.array(train["text"][train["target"] == 1].progress_apply(textstat.linsear_write_formula))
plot_readability(lwf_notreal,lwf_real,"Linsear Write Formula",2,['#8D99AE','#EF233C'])

## 9.7 Dale-Chall Readability Score <br>
* Different from other tests, since it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula. <br>
* The formula for calculating the raw score of the Dale–Chall readability score is given below: <br>
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/0541f1e629f0c06796c5a5babb3fac8d100a858c)

* Score - Understood by

** 4.9 or lower - average 4th-grade student or lower ** <br>
** 5.0–5.9 - average 5th or 6th-grade student ** <br>
** 6.0–6.9 - average 7th or 8th-grade student ** <br>
** 7.0–7.9 - average 9th or 10th-grade student ** <br>
** 8.0–8.9 - average 11th or 12th-grade student ** <br>
** 9.0–9.9 - average 13th to 15th-grade (college) student** <br>

* Read More : [Wikipedia](https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula)

In [None]:
dcr_notreal = np.array(train["text"][train["target"] == 0].progress_apply(textstat.dale_chall_readability_score))
dcr_real = np.array(train["text"][train["target"] == 1].progress_apply(textstat.dale_chall_readability_score))
plot_readability(dcr_notreal,dcr_real,"Dale-Chall Readability Score",1,['#C65D17','#DDB967'])

## 9.8 Readability Consensus based upon all the above tests <br>
* Based upon all the above tests, returns the estimated school grade level required to understand the text. <br>

In [None]:
def consensus_all(text):
    return textstat.text_standard(text,float_output=True)

con_notreal = np.array(train["text"][train["target"] == 0].progress_apply(consensus_all))
con_real = np.array(train["text"][train["target"] == 1].progress_apply(consensus_all))
plot_readability(con_notreal,con_real,"Readability Consensus based upon all the above tests",2)

# 10. Topic Modelling <br>
* Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans. <br>
* Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. <br>
* Read More : [Here](https://monkeylearn.com/blog/introduction-to-topic-modeling/) <br>

In [None]:
notreal_text = train["text"][train["target"] == 0].progress_apply(spacy_tokenizer)
real_text = train["text"][train["target"] == 1].progress_apply(spacy_tokenizer)
#count vectorization
vectorizer_notreal = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
notreal_vectorized = vectorizer_notreal.fit_transform(notreal_text)
vectorizer_real = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
real_vectorized = vectorizer_real.fit_transform(real_text)

## 10.1 Latent Dirichlet Allocation (LDA) <br>
* Latent Dirichlet Allocation (LDA) and LSA are based on the same underlying assumptions: the distributional hypothesis, (i.e. similar topics make use of similar words) and the statistical mixture hypothesis (i.e. documents talk about several topics) for which a statistical distribution can be determined. The purpose of LDA is mapping each document in our corpus to a set of topics which covers a good deal of the words in the document. <br>

* What LDA does in order to map the documents to a list of topics is assign topics to arrangements of words, e.g. n-grams such as best player for a topic related to sports. This stems from the assumption that documents are written with arrangements of words and that those arrangements determine topics.LDA ignores syntactic information and treats documents as bags of words. It also assumes that all words in the document can be assigned a probability of belonging to a topic. That said, the goal of LDA is to determine the mixture of topics that a document contains. <br>

* In other words, LDA assumes that topics and documents look like this: <br>
![](https://pcdn.piiojs.com/i/kqctmw/vw,1536,vh,0,r,1,pr,1.3,wp,1/https%3A%2F%2Flh4.googleusercontent.com%2FyW-fEumGuN1mRPbcUG3K0br-7pGgNcFTMhv1flKg1foshZ7fzbUD7hDxifs9seBLJcnEBBAo-sCeO3zjjhCSduETRtdSwIkk5gDosxV4Ijo98NtibeknHBrPD2bH_AHampsy2lGM) <br>
* And, when LDA models a new document, it works this way: <br>
![](https://pcdn.piiojs.com/i/kqctmw/vw,1536,vh,0,r,1,pr,1.3,wp,1/https%3A%2F%2Flh3.googleusercontent.com%2FiCqGC2Yuyjb0HqlQcaVmAioCg3k56PQJXs7-pinKZwCiy00tL6ObLnVdJy6JRrGWRtnkunP6v94RRoxRlJsdJk00ydPKWnmSxr-zns4WJz4k7IHIMfbnKFSXb0tfvs3_09Txebjo) <br>




In [None]:
# Latent Dirichlet Allocation Model
lda_notreal = LatentDirichletAllocation(n_components=10, max_iter=5, learning_method='online',verbose=True)
notreal_lda = lda_notreal.fit_transform(notreal_vectorized)
lda_real = LatentDirichletAllocation(n_components=10, max_iter=5, learning_method='online',verbose=True)
real_lda = lda_real.fit_transform(real_vectorized)

## 10.2 Printing keywords

In [None]:
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

In [None]:
print("Not real disaster tweets LDA Model:")
selected_topics(lda_notreal, vectorizer_notreal)

In [None]:
print("Real disaster tweets LDA Model:")
selected_topics(lda_real, vectorizer_real)

## 10.3 Visualizing LDA results of not a real disaster tweets with pyLDAvis

In [None]:
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_notreal, notreal_vectorized, vectorizer_notreal, mds='tsne')
dash

## 10.4 Visualizing LDA results of a real disaster tweets with pyLDAvis 

In [None]:
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda_real, real_vectorized, vectorizer_real, mds='tsne')
dash

# 11. Tfidf Vectorization

In [None]:
tfidf_vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
tfidf_vec.fit_transform(train['text'].values.tolist() + test['text'].values.tolist())
train_tfidf = tfidf_vec.transform(train['text'].values.tolist())
test_tfidf = tfidf_vec.transform(test['text'].values.tolist())

# 12. Building basic Logistic regression model

In [None]:
train_y = train["target"].values

def runModel(train_X, train_y, test_X, test_y, test_X2):
    model = linear_model.LogisticRegression(C=5., solver='sag')
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)[:,1]
    pred_test_y2 = model.predict_proba(test_X2)[:,1]
    return pred_test_y, pred_test_y2, model

print("Building model.")
cv_scores = []
pred_full_test = 0
pred_train = np.zeros([train.shape[0]])
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2017)
for dev_index, val_index in kf.split(train):
    dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    pred_val_y, pred_test_y, model = runModel(dev_X, dev_y, val_X, val_y, test_tfidf)
    pred_full_test = pred_full_test + pred_test_y
    pred_train[val_index] = pred_val_y
    cv_scores.append(metrics.log_loss(val_y, pred_val_y))
    break

# 13. Threshold value search for better score

In [None]:
from tqdm import tqdm
def threshold_search(y_true, y_proba):
#reference: https://www.kaggle.com/hung96ad/pytorch-starter
    best_threshold = 0
    best_score = 0
    for threshold in tqdm([i * 0.001 for i in range(1000)]):
        score = metrics.f1_score(y_true=y_true, y_pred=y_proba > threshold)
        if score > best_score:
            best_threshold = threshold
            best_score = score
    search_result = {'threshold': best_threshold, 'f1': best_score}
    return search_result
search_result = threshold_search(val_y, pred_val_y)
search_result

In [None]:
print("F1 score at threshold {0} is {1}".format(0.381, metrics.f1_score(val_y, (pred_val_y>0.381).astype(int))))
print("Precision at threshold {0} is {1}".format(0.381, metrics.precision_score(val_y, (pred_val_y>0.381).astype(int))))
print("recall score at threshold {0} is {1}".format(0.381, metrics.recall_score(val_y, (pred_val_y>0.381).astype(int))))

Now let us look at the important words used for classifying when target = 1.

In [None]:
import eli5
eli5.show_weights(model, vec=tfidf_vec, top=100, feature_filter=lambda x: x != '<BIAS>')

# 14. Generating random tweets using Markov Chains <br>
* This is just a fun part in this kernal. Until now in this kernel we saw about text analytics,Topic modelling and basic model fitting. Now we will also learn about Generating text using Markov Chains. <br>


## What is Markov Chains? <br>
#### Markov chains, named after Andrey Markov, are mathematical systems that hop from one "state" (a situation or set of values) to another. For example, if you made a Markov chain model of a baby's behavior, you might include "playing," "eating", "sleeping," and "crying" as states, which together with other behaviors could form a 'state space': a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or "transitioning," from one state to any other state---e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first. <br>
#### Below is the example of markov chains: <br>
![](https://cdn-images-1.medium.com/max/800/1*MbHRwYNA8F29hzes8EPHiQ.gif) <br>
* Credits and references:
* [Reference 1](http://setosa.io/ev/markov-chains/)<br>
* [Reference 2](https://www.kaggle.com/nulldata/meaningful-random-headlines-by-markov-chain/notebook)
* [Reference 3](https://hackernoon.com/from-what-is-a-markov-model-to-here-is-how-markov-models-work-1ac5f4629b71) 


In [None]:
#loading markovify module
import markovify

In [None]:
#preparing dataset
data_notreal = train["text"][train["target"] == 0]
data_real = train["text"][train["target"] == 1]

## 14.1 Building text model

In [None]:
text_model_notreal = markovify.NewlineText(data_notreal, state_size = 2)
text_model_real = markovify.NewlineText(data_real, state_size = 2)

## 14.2 Generating tweets about not a real disaster

In [None]:
for i in range(10):
    print(text_model_notreal.make_sentence())

## 14.3 Generating tweets about real disaster

In [None]:
for i in range(10):
    print(text_model_real.make_sentence())

# Thats all for now.<font color='red'>If you like this kernel please consider Upvoting it</font>.I welcome suggestions to improve this kernel further.<br>
## Thank you 😊,Happy learning. <br>