> This kernel is based on the work of http://hunterheidenreich.com/blog/elmo-word-vectors-in-keras/

1. # 1. Kernel Overview

## 1.1 Defination :

In today world** Text Classification/Segmentation/Categorization** (for example ticket categorization in a call centre, email classification, logs category detection etc.) is a common task. With humongous data out there, its nearly impossible to do this manually. Let's try to solve this problem automatically using machine learning and natural language processing tools.

## 1.2 Problem Statement

BBC articles dataset(2126 records) consist of two features text and the assiciated categories namely 
1. Sport 
2. Business 
3. Politics 
4. Tech 
5. Others

**Our task is to train a multiclass classification model on the mentioned dataset.**

## 1.3 Metrics

**Accuracy** - Classification accuracy is the number of correct predictions made as a
ratio of all predictions made

**Precision** - precision (also called positive predictive value) is the fraction of
relevant instances among the retrieved instances

**F1_score** - considers both the precision and the recall of the test to compute the
score

**Recall** – recall (also known as sensitivity) is the fraction of relevant instances that
have been retrieved over the total amount of relevant instances

**Why these metrics?** - We took Accuracy, Precision, F1 Score and Recall as metrics
for evaluating our model because accuracy would give an estimate of correct prediction. Precision would give us an estimate about the positive category predicted value i.e. how much our model is giving relevant result. F1 Score gives a clubbed estimate of precision and recall.Recall would provide us the relevant positive category prediction to the false negative and true positive category recognition results.

## 1.4 Machine Learning Model Considered:

We will be using **ELMO embeddings with KERAS** for this use case. 

ELMO and KERAS is not in the scope of this kernal. Kindly refer other external sources.

# 2. Data Exploration

### Step 2.1 Load Dataset

In [None]:
import pandas as pd

data=pd.read_csv(r"../input/bbc-text.csv")

In [None]:
data.head()

# 3. Implementation

### Step 2.2 Map Textual labels to numeric using Label Encoder

In [None]:
from sklearn.preprocessing import LabelEncoder
df2 = pd.DataFrame()
df2["text"] = data["text"]
df2["label"] = LabelEncoder().fit_transform(data["category"])

In [None]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')
df2['text'] = df2['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df2['text'].head()

In [None]:
df2.head()

In [None]:
freq = pd.Series(' '.join(df2['text']).split()).value_counts()[-10:]
df2['text'] = df2['text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
df2['text'].head()

### Step 2.3 Import the Libraries

In [None]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle
pd.set_option('display.max_colwidth', 200)

In [None]:
import tensorflow_hub as hub
import tensorflow as tf

embed = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

### Step 2.4 Convert Sentence to Elmo Vectors

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
from sklearn import preprocessing
import keras
import numpy as np


y = list(df2['label'])
x = list(df2['text'])

le = preprocessing.LabelEncoder()
le.fit(y)

def encode(le, labels):
    enc = le.transform(labels)
    return keras.utils.to_categorical(enc)

def decode(le, one_hot):
    dec = np.argmax(one_hot, axis=1)
    return le.inverse_transform(dec)


x_enc = x
y_enc = encode(le, y)

### Step 2.5 Divide dataset to test and train dataset

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(np.asarray(x_enc), np.asarray(y_enc), test_size=0.2, random_state=42)

In [None]:
x_train.shape

### Step 2.5 Train Keras neural model with ELMO Embeddings

In [None]:
from keras.layers import Input, Lambda, Dense
from keras.models import Model
import keras.backend as K

def ELMoEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(ELMoEmbedding, output_shape=(1024, ))(input_text)
dense = Dense(256, activation='relu')(embedding)
pred = Dense(5, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    history = model.fit(x_train, y_train, epochs=1, batch_size=16)
    model.save_weights('./elmo-model.h5')

with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model.load_weights('./elmo-model.h5')  
    predicts = model.predict(x_test, batch_size=16)

y_test = decode(le, y_test)
y_preds = decode(le, predicts)



# 4. Results

In [None]:
from sklearn import metrics

print(metrics.confusion_matrix(y_test, y_preds))

print(metrics.classification_report(y_test, y_preds))

from sklearn.metrics import accuracy_score

print("Accuracy of ELMO is:",accuracy_score(y_test,y_preds))

>** Past Work mentioned on this dataset at max achieved 95.22 accuracies. keras with ELMO embeddings achieved 95.5. But still BERT base model without any preprocessing and achieved 97.75 accuracies.

[bert model]https://www.kaggle.com/sarthak221995/textclassification-97-77-accuracy-bert
**

# 5. Future Improvements on this kernel:

* Explore preprocessing steps on data.
* Explore other models as baseline.
* Make this notebook more informative and illustrative.
* Explaination on ELMO Embeddings Model.
* More time on data exploration
and many more...

# 6. References

> This kernel is based on the work of http://hunterheidenreich.com/blog/elmo-word-vectors-in-keras/