<a href="https://colab.research.google.com/github/faisalisafk/python_scripts/blob/main/Named_entity_custom_with_SPACY.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 0. Problem Formulation and approach for my solution
I have to solve a text retrival problem where I have to check for store number in text. The problem includes a small dataset containing 300 rows in total. Also it contains three column :transaction_descriptor, store_number, and dataset. I have to select and train a model so that it can find store number given a text as input.

I am going to use a pretrained entity extraction model. It is spaCy's Named Entity Recognition (NER) model. This system features a sophisticated word embedding strategy using subword features and "Bloom" embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing. The system is designed to give a good balance of efficiency, accuracy and adaptability with ease of custom dataset training. For more information, please check this [link](https://spacy.io/api/entityrecognizer). 

# Step 01. Import necessary Dependencies
Using pandas, spacy, matplotlib for data processing 

In [1]:
import pandas as pd
import spacy
from spacy import displacy
from wordcloud import WordCloud, STOPWORDS
from spacy.util import minibatch, compounding
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt
import re
import random
import math
import copy
from tqdm.notebook import tqdm
random.seed(42)

# Step 02. Downloading dataset from github and preprocessing 
I have uploaded the dataset on google drive so we download the dataset from there. Then, we check data information and preprocess null value, empty value or any dupllicated values if there exists any.


In [2]:
# Downloading dataset from google drive
!gdown 1CvzAQXHMTkRUWmXnTovGKfdPetCLTxY4

df = pd.read_csv('Summer Internship - Homework Exercise.csv')
df.sample(10)

Downloading...
From: https://drive.google.com/uc?id=1CvzAQXHMTkRUWmXnTovGKfdPetCLTxY4
To: /content/Summer Internship - Homework Exercise.csv
  0% 0.00/10.7k [00:00<?, ?B/s]100% 10.7k/10.7k [00:00<00:00, 15.5MB/s]


Unnamed: 0,transaction_descriptor,store_number,dataset
180,JIMMY JOHNS - 2,2,validation
89,NST BEST BUY #500 970304,500,train
44,CIRCLE K # 06524 OCALA FL,6524,train
277,NST BEST BUY #405 562536,405,test
168,SUBWAY 04016168,4016168,validation
3,BUFFALO WILD WINGS 003,3,train
244,WALGREENS #11332,11332,test
88,CASEYS GEN STORE 2,2,train
50,NST ROSS STORES #11572299,11572299,train
223,NNT POLO/RL WRENTHA130571,13057,test


In [3]:
df.count()

transaction_descriptor    300
store_number              300
dataset                   300
dtype: int64

In [4]:
df.isnull().sum()

transaction_descriptor    0
store_number              0
dataset                   0
dtype: int64

In [5]:
df[df.duplicated(keep=False)]

Unnamed: 0,transaction_descriptor,store_number,dataset


In [6]:
df.nunique()

transaction_descriptor    300
store_number              293
dataset                     3
dtype: int64

# Step 03. Splitting data for training and testing purpose
Here we have only 200 data for training and 100 data for testing

In [7]:
train_df = df.loc[df['dataset'].isin(['train', 'validation'])]
test_df = df.loc[df['dataset'] == 'test']

print(train_df.shape)
print(test_df.shape)

(200, 3)
(100, 3)


# Step 04. Loading spaCy for custom training model
I am using spaCy, a free open-source library for Natural Language Processing in Python. Utilizing the Named Entity Recognition (NER) pipeline to predict *store number* from the text.

In [8]:
# loads en_core_web_sm,an English pipeline trained on text that includes vocabulary, syntax and entities.
nlp0 = spacy.load('en_core_web_sm') 

#shows all available pipelines
nlp0.pipe_names

['tagger', 'parser', 'ner']

In [9]:
#getting NER pipeline
ner0 = nlp0.get_pipe('ner')

# Step 05. Data formating for spaCy
Here we format our data as required by spaCy. You can find more information here: https://spacy.io/api/data-formats

In simple words, the train data is nothing but a list of tuples containing four attributes:


*   Sentence/ keyword

*   Start of the word for which custom entity is defined
*   End of the same word for which the custom entity is defined


*   The custom Label








In [10]:
# takes a text and label and format the in tuple
def format_data(text,label):
    start = text.find(label)
    end = len(label) + start
    entity = (start,end,'number')
    entities = []
    ent_dict = {}
    entities.append(entity)
    ent_dict['entities'] = entities
    #text = text.replace(" ", "0")
    train_item = (text, ent_dict)
    return train_item

# Step 06. Using format_data function for splitting dataset

In [11]:
TRAIN_DATA = []
TEST_DATA = []

# Formatting training data
for _, item in train_df.iterrows():
    train_item = format_data(item['transaction_descriptor'],item['store_number'])
    TRAIN_DATA.append(train_item)
print(len(TRAIN_DATA))

# Formatting testing data
for _, item in test_df.iterrows():
    test_item = format_data(item['transaction_descriptor'],item['store_number'])
    TEST_DATA.append(test_item)
print(len(TEST_DATA))


200
100


# Step 07. Training loop for NER
Here, first we create a NLP model object and add NER to it.
Then we add our custom label '*store_number*'. Next we make batches of data and update NLP model using dropout. Also we keep a copy of the best model per itereation. 

In [12]:
def train_ner(training_data,n_iter = 50):
    """Steps
    Create a Blank NLP  model object
    Create and add NER to the NLP model
    Add Labels from your training data
    Train  
    """
    best = math.inf
    TRAIN_DATA = training_data
    nlp = spacy.blank("en")  # create blank Language class
    
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")
        
    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    disable_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

    with nlp.disable_pipes(*disable_pipes):
        optimizer = nlp.resume_training()

    nlp.begin_training()
    for itn in tqdm(range(n_iter)):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA)
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                losses=losses,
                sgd=optimizer
            )
        if losses['ner'] < best:
            best = losses['ner']
            best_nlp = copy.deepcopy(nlp)
            print('Best Loss:',losses)
    return best_nlp

# Step 08. Train model on custom data

In [13]:
nlp2 = train_ner(TRAIN_DATA,100)

  0%|          | 0/100 [00:00<?, ?it/s]

Best Loss: {'ner': 330.48873929679394}
Best Loss: {'ner': 121.7360690254718}
Best Loss: {'ner': 48.28553517818477}
Best Loss: {'ner': 21.574708651514836}
Best Loss: {'ner': 9.311595288918147}
Best Loss: {'ner': 7.503397853335169}
Best Loss: {'ner': 4.770427097386712}
Best Loss: {'ner': 3.6880026426715258}
Best Loss: {'ner': 2.1217363503121414}
Best Loss: {'ner': 1.757901475643759}
Best Loss: {'ner': 1.6125089678903006}
Best Loss: {'ner': 0.000988578242443247}
Best Loss: {'ner': 4.151221589625196e-06}
Best Loss: {'ner': 1.824475880238575e-06}
Best Loss: {'ner': 1.606220695999317e-08}
Best Loss: {'ner': 3.407900953851571e-09}
Best Loss: {'ner': 5.187294779389708e-10}
Best Loss: {'ner': 3.870497016960632e-11}
Best Loss: {'ner': 2.985332627365027e-11}
Best Loss: {'ner': 2.440289394757227e-12}


# Step 09. Checking Accuracy on Test Dataset


In [14]:
#checking how many were good
count = 0
ok = []
for index, row in test_df.iterrows():

    docx2 = nlp2(row['transaction_descriptor'])
    try:   
        if str(docx2.ents[0]).lstrip('0') == str(row['store_number']):
            count+=1
            ok.append(index)   
    except:
        continue
print('Accuracy on test dataset: ',count/len(test_df))

Accuracy on test dataset:  0.86


# Step 10. Issues and Visualizing Wrong Predictions
Here we check for data that we could not predict or the model did not find any custom entity on the data


In [15]:
#checking issues

count = 0
print('{0: <50}'.format('Text'),'{0: <15}'.format('store_number'),'{0: <15}'.format('Prediction') )
print('='*100)
for index, row in test_df.iterrows():
    if index not in ok:
        docx2 = nlp2(row['transaction_descriptor'])
        try:
            result = str(docx2.ents[0])
        except:
            # when model could not find any entity
            result = 'NONE'
        result = result.lstrip('0')
        print('{0: <50}'.format(str(row['transaction_descriptor']) ),'{0: <15}'.format(str(row['store_number'])),'{0: <15}'.format(result) )
        
        count += 1
print('\nTotal Wrong prediction in Testing dataset: ',count)
 

Text                                               store_number    Prediction     
BP#9442088LIBERTYVILLE B                           9442088         NONE           
LBOUTLETS#4249 1475 N BUR                          4249            1475           
BP#8644346ES #30 B96                               8644346         30             
NNT POLO/RL WRENTHA130571                          13057           NONE           
NNT SEARS HOMETOWN 862751                          8627            862751         
NAVY EXCHANGE 050161 0003                          50161           3              
NNT FAMOUS FOOTWEAR001261                          1261            FOOTWEAR001261 
EXPRESS#0813                                       813             NONE           
NNT FAMOUS FOOTWEAR730376                          730376          NONE           
MARATHON PETRO170928   MIAMI                       170928          NONE           
NNT FAMOUSFOOTWEAR#132427                          132427          FAMOUSFOOTWEAR#13242

# Result Analysis and Comments
We can see that the model did not do a good job of predicting subtext store number in this case.
One of the reason I can assume that the tokenization works in a word level. When taking raw text and tokenizing it, the data model could only recognize word tokens and not subtext. We can also see that it was predicting the wrong store number in some cases. Assuming that our custom dataset of only 200 training data hampered the model's ability to generalize well.