<a href="https://colab.research.google.com/github/HassanJoumaa/KAGGLE_NER_DATASET/blob/main/Tensoflow_NER_DATASET.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NER DATASET**
This is a very clean dataset and is for anyone who wants to try his/her hand on the NER ( Named Entity recognition ) task of NLP.

## ***1. Problem***

We will use this dataset in order to train an NER model which will be able to successfully identify the different Tags in a sentence.

## ***2. Data***

The data we're using is from Kaggle's NER_dataset.

https://www.kaggle.com/namanj27/ner-dataset

## ***3. Evaluation***

We will evaluate the model based on the accuracy metric.


## ***4. Features***
* The dataset with 1M x 4 dimensions contains columns = ['# Sentence', 'Word', 'POS', 'Tag'] and is grouped by #Sentence.

**Columns**

Word:
This column contains English dictionary words form the sentence it is taken from.

POS:
Parts of speech tag

Tag:
Standard named entity recognition tags as follows:

* ORGANIZATION - Georgia-Pacific Corp., WHO
* PERSON - Eddy Bonte, President Obama
* LOCATION - Murray River, Mount Everest
* DATE - June, 2008-06-29
* TIME - two fifty a m, 1:30 p.m.
* MONEY - 175 million Canadian Dollars, GBP 10.40
* PERCENT - twenty pct, 18.75 %
* FACILITY - Washington Monument, Stonehenge
* GPE - South East Asia, Midlothian

### **Import the Libraries**

In [None]:
%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

### **Download the Data from Kaggle**

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
# Adding the Username and Key from the Kaggle Token Folder
os.environ['KAGGLE_USERNAME']="hassanjoumaa"
os.environ['KAGGLE_KEY']="d3077228d9aecdd27fd4f73a4fa4b31d"

In [None]:
# Downloading the Dataset from Kaggle
!kaggle datasets download -d namanj27/ner-dataset

In [None]:
# Unziping the Folder
!unzip ner-dataset.zip

### **Get & Clean the Data** 

In [None]:
df = pd.read_csv("/content/ner-dataset.zip", encoding='latin1')
df.drop("POS",axis=1, inplace=True)
df.fillna(method="ffill", inplace=True)
df.head(50)

> ***Group the records by Sentence #***

In [None]:
agg_fun = lambda s: [(w, t) for w, t in zip(s["Word"],s['Tag'])]      
grouped = df.groupby('Sentence #').apply(agg_fun)
grouped

### **Get the Sentences and their respective Tags**

In [None]:
sentences = [[w[0] for w in s] for s in grouped]
tags = [[t[1] for t in s] for s in grouped]

In [None]:
print(sentences[0])
print(tags[0])
print(len(sentences[0]))
print(len(tags[0]))

In [None]:
lengths = [len(s) for s in sentences]
plt.hist(lengths, bins = 50)
plt.plot()

> ***Plotting the lengths shows that it would be a good idea to take the max length for the input as 50.*** 

### **Tokenizing the Data**

In [None]:
embedding_dim = 32
max_length = 50
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

#### ***Tokenizing the Sentences***

In [None]:
tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index)+1
sequences = [[word_index[w.lower()] for w in s] for s in sentences]
X = pad_sequences(sequences=sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

#### ***Tokenizing the Labels***

In [None]:
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(tags)
labels_word_index = label_tokenizer.word_index
labels = [[labels_word_index[l.lower()] for l in t] for t in tags]
padded_labels= np.array(pad_sequences(sequences=labels, maxlen=max_length, padding=padding_type, truncating=trunc_type, value=labels_word_index["o"]))-1
y = [to_categorical(l, num_classes=len(labels_word_index)) for l in padded_labels]

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=42)

train_dataset = tf.data.Dataset.from_tensor_slices((tf.constant(X_train), tf.constant(y_train)))
train_dataset = train_dataset.batch(64)
val_dataset = tf.data.Dataset.from_tensor_slices((tf.constant(X_val), tf.constant(y_val)))
val_dataset = val_dataset.batch(64)

### **Creating & training the model**

In [None]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
                             tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True, recurrent_dropout=0.1)),
                             tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(len(labels_word_index), activation='softmax'))
])
model.summary()

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
history = model.fit(train_dataset, epochs=3, validation_data=val_dataset)

### **Testing on our Sentences**

In [None]:
labels_index_word = dict([(value, key) for (key, value) in labels_word_index.items()])
labels_index_word

In [None]:
def predict_on_sentence():
  my_sentence = input("Enter your own sentence: ")
  my_sequence = tokenizer.texts_to_sequences([my_sentence])
  my_padded = pad_sequences(sequences=my_sequence, maxlen=max_length, padding=padding_type, truncating=trunc_type)
  prediction = model.predict(np.array(my_padded))
  p = np.argmax(prediction, axis=-1)+1
  labeled_preds = [labels_index_word[label] for label in p[0]]
  return labeled_preds

In [None]:
predict_on_sentence()