<a href="https://colab.research.google.com/github/AjeetSingh02/Notebooks/blob/master/NLP_TransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> Applying ULMfit for classification<h1>

In [0]:
# import libraries
import fastai
from fastai import *
from fastai.text import * 
import pandas as pd
import numpy as np
from functools import partial
import io
import os

In [0]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [0]:
df = pd.DataFrame({'label':dataset.target, 'text':dataset.data})

In [0]:
df.shape

(11314, 2)

In [0]:
df = df[df['label'].isin([1,10])]
df = df.reset_index(drop = True)

In [0]:
df.shape

(1184, 2)

In [0]:
df['label'].value_counts()

10    600
1     584
Name: label, dtype: int64

In [0]:
df['text'] = df['text'].str.replace("[^a-zA-Z]", " ")

In [0]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
# tokenization 
tokenized_doc = df['text'].apply(lambda x: x.split())

# remove stop-words 
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization 
detokenized_doc = [] 
for i in range(len(df)): 
    t = ' '.join(tokenized_doc[i]) 
    detokenized_doc.append(t) 

df['text'] = detokenized_doc

In [0]:
from sklearn.model_selection import train_test_split

# split data into training and validation set
df_trn, df_val = train_test_split(df, stratify = df['label'], test_size = 0.4, random_state = 12)

In [0]:
df_trn.shape, df_val.shape

((710, 2), (474, 2))

In [0]:
# Language model data
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "")

# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=32)

In [0]:
learn = language_model_learner(data_lm, arch=AWD_LSTM, drop_mult=0.7)

In [0]:
# train the learner object with learning rate = 1e-2
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,6.704717,5.508235,0.238329,07:06


In [0]:
learn.save_encoder('ft_enc')

In [0]:
learn = text_classifier_learner(data_clas, arch=AWD_LSTM, drop_mult=0.7)
learn.load_encoder('ft_enc')

In [0]:
learn.fit_one_cycle(1, 1e-2)


epoch,train_loss,valid_loss,accuracy,time
0,0.441268,0.210568,0.898734,19:04


In [0]:
# get predictions
preds, targets = learn.get_preds()

predictions = np.argmax(preds, axis = 1)
pd.crosstab(predictions, targets)

<h1>Flair implementation using Flair embedding</h1>

In [0]:
from flair.data import Sentence
from flair.models import SequenceTagger

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


In [0]:
text = '''Original Tax Invoice
OOoLA IGAAJCA1389G1ZG
999799
ANI Technologies Pvt. Ltd.
ANI Technologies Pvt. Ltd., Infinity Think Tank, Business Auxiliary
Tower-1, 2nd floor, Plot-A3, Block-GP,Sector-5 Service
Salt Lake,Kolkata : 700091
CIUPELNHR56992 19/05/2018
Ronalisha +919886153386
ANI Technologies Pvt. Ltd., Infinity Think
Tank, Tower-1, 2nd floor, Plot-A3, Block-GP,Sector-5, Salt
Lake,Kolkata : 700091
Description Amount (INR)
Ola Convenience Fee - CRN1851147515
Convenience Fee (Ride) 418.86
CGST 7
x4
SGST
Te
eee
Total
C c 22.26
pepe ceccan acne neneaneeeeneeeeeee ee
Authorised Signatory
Q2x
—_—_——_
: -
i .
‘ 7
i
o
i
. 1
----------------------
'''
sentence = Sentence(text)

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)


#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

2019-03-25 07:23:49,681 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/models-v0.4/NER-conll03-english/en-ner-conll03-v0.4.pt not found in cache, downloading to /tmp/tmpfzx6xjgy


100%|██████████| 432197603/432197603 [00:24<00:00, 17614673.71B/s]

2019-03-25 07:24:14,779 copying /tmp/tmpfzx6xjgy to cache at /root/.flair/models/en-ner-conll03-v0.4.pt





2019-03-25 07:24:16,305 removing temp file /tmp/tmpfzx6xjgy
2019-03-25 07:24:16,307 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
LOC-span [36]: "Salt
Lake,Kolkata"


In [0]:
text = '''SCALED AGILE” Invoice
scated Agile, In ’ :
400 Airport Blvd, #300 | Number | Date |
Boulder, CO 80301 — = }
136085 | 2/28/18 |
Bill To [stip uD _ = a |
eee) ee
Satyanarayana Murthy Kotta Satyanarayana Murthy Kotta
Infosys Ltd, Manikonda | Infosys Ltd, Manikonda
HYDERABAD, Telangana, XXXXXX | | HYDERABAD, Telangana, XXXXXX |
India | India
eee
PO Number Terms Due Date |
#00040341 | Due on receipt 2/28/18 |
|
Payments/Adjustments
laty | Description Price (USD) | Discount (USD) | Totals |
pee
| 5.00 | Leading SAFe 4 (For Satyanarayana Murthy Kotta) $100.00 $500.00 |
| Sub-Total $500.00 |
Payment Due | $500.00 |
ee Farren Applied} ($500.00)|
Balance Due} _$0.00|
|
'
Si.
s an
KL
ae
ang
----------------------
'''
sentence = Sentence(text)

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)


#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

2019-03-25 07:26:13,771 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
ORG-span [8,9]: "Airport Blvd,"
LOC-span [16]: "CO"
ORG-span [31,32,33,34,35,36,37,38]: "ee
Satyanarayana Murthy Kotta Satyanarayana Murthy Kotta
Infosys Ltd, Manikonda"
ORG-span [40]: "Infosys"
ORG-span [42,43,44]: "Manikonda
HYDERABAD, Telangana, XXXXXX"


In [0]:
text = '''TAK INVOICE
hie MT Ir I :
PES INE INNS LIB LA
\ Unit of Choudhary Ventures)
©CO-303P& 364 Sector-8,Panchkula(HR)-134109
PH: 0172-4676555
ROYAL RE TREAT
GSTIN: O6ADDFSO0083P2Z5
Bill No 4 lime: 22:13:28 Date:04/06/2018
é No..07 Pax: 1 vtev L
KOTS:45
Particulars Qty Rate GST% Amount
DINNER VEG. BUFFET 5 476.19 5.0 2380.95
Tctal 5 2380.95
CGST 59.52
SGST 59,52
ROYALRE Grand Total 2500.00
Rupees Two Thousand Five Hundred only
THANKS & PLZ VISIT AGAIN
----------------------
'''
sentence = Sentence(text)

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)


#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

2019-03-25 07:27:26,414 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
ORG-span [1]: "TAK"


<h1>Flair implementation using BERT embeddings</h1>

In [0]:
from flair.embeddings import BertEmbeddings
# init embedding
embedding = BertEmbeddings('bert-large-cased')

100%|██████████| 213450/213450 [00:00<00:00, 2389883.32B/s]
100%|██████████| 1242874899/1242874899 [00:20<00:00, 59225166.87B/s]


In [0]:
#text
text = '''Original Tax Invoice
OOoLA IGAAJCA1389G1ZG
999799
ANI Technologies Pvt. Ltd.
ANI Technologies Pvt. Ltd., Infinity Think Tank, Business Auxiliary
Tower-1, 2nd floor, Plot-A3, Block-GP,Sector-5 Service
Salt Lake,Kolkata : 700091
CIUPELNHR56992 19/05/2018
Ronalisha +919886153386
ANI Technologies Pvt. Ltd., Infinity Think
Tank, Tower-1, 2nd floor, Plot-A3, Block-GP,Sector-5, Salt
Lake,Kolkata : 700091
Description Amount (INR)
Ola Convenience Fee - CRN1851147515
Convenience Fee (Ride) 418.86
CGST 7
x4
SGST
Te
eee
Total
C c 22.26
pepe ceccan acne neneaneeeeneeeeeee ee
Authorised Signatory
Q2x
—_—_——_
: -
i .
‘ 7
i
o
i
. 1
----------------------

'''

# create a sentence
sentence = Sentence(text)

# embed words in sentence
embedding.embed(sentence)


tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)


#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)


2019-03-25 07:30:25,919 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
LOC-span [36]: "Salt
Lake,Kolkata"


In [0]:
#text
text = '''SCALED AGILE” Invoice
scated Agile, In ’ :
400 Airport Blvd, #300 | Number | Date |
Boulder, CO 80301 — = }
136085 | 2/28/18 |
Bill To [stip uD _ = a |
eee) ee
Satyanarayana Murthy Kotta Satyanarayana Murthy Kotta
Infosys Ltd, Manikonda | Infosys Ltd, Manikonda
HYDERABAD, Telangana, XXXXXX | | HYDERABAD, Telangana, XXXXXX |
India | India
eee
PO Number Terms Due Date |
#00040341 | Due on receipt 2/28/18 |
|
Payments/Adjustments
laty | Description Price (USD) | Discount (USD) | Totals |
pee
| 5.00 | Leading SAFe 4 (For Satyanarayana Murthy Kotta) $100.00 $500.00 |
| Sub-Total $500.00 |
Payment Due | $500.00 |
ee Farren Applied} ($500.00)|
Balance Due} _$0.00|
|
'
Si.
s an
KL
ae
ang
----------------------
'''

# create a sentence
sentence = Sentence(text)

# embed words in sentence
embedding.embed(sentence)


tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)


#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)


2019-03-25 07:31:35,936 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
ORG-span [8,9]: "Airport Blvd,"
LOC-span [16]: "CO"
ORG-span [31,32,33,34,35,36,37,38]: "ee
Satyanarayana Murthy Kotta Satyanarayana Murthy Kotta
Infosys Ltd, Manikonda"
ORG-span [40]: "Infosys"
ORG-span [42,43,44]: "Manikonda
HYDERABAD, Telangana, XXXXXX"


In [0]:
#text
text = '''TAK INVOICE
hie MT Ir I :
PES INE INNS LIB LA
\ Unit of Choudhary Ventures)
©CO-303P& 364 Sector-8,Panchkula(HR)-134109
PH: 0172-4676555
ROYAL RE TREAT
GSTIN: O6ADDFSO0083P2Z5
Bill No 4 lime: 22:13:28 Date:04/06/2018
é No..07 Pax: 1 vtev L
KOTS:45
Particulars Qty Rate GST% Amount
DINNER VEG. BUFFET 5 476.19 5.0 2380.95
Tctal 5 2380.95
CGST 59.52
SGST 59,52
ROYALRE Grand Total 2500.00
Rupees Two Thousand Five Hundred only
THANKS & PLZ VISIT AGAIN
----------------------
'''

# create a sentence
sentence = Sentence(text)

# embed words in sentence
embedding.embed(sentence)


tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)


#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)


2019-03-25 07:32:39,940 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
ORG-span [1]: "TAK"


In [0]:
tagger = SequenceTagger.load('ner')

# run NER over sentence
#text = "a Guest Information: SAXENA/SALESH/MR Dates of Stay: 19/01/2016 - 11/03/2016"
text = "Chennai"

sentence = Sentence(text)

# embed words in sentence
embedding.embed(sentence)

tagger.predict(sentence)

#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

2019-03-25 07:33:17,576 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
LOC-span [1]: "Chennai"


<h1> Combining BERT and Flair embeddings

In [0]:
from flair.embeddings import FlairEmbeddings, BertEmbeddings

# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')

#Now instantiate the StackedEmbeddings class and pass it a list containing these three embeddings.

from flair.embeddings import StackedEmbeddings

# create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])

text = '''SCALED AGILE” Invoice
scated Agile, In ’ :
400 Airport Blvd, #300 | Number | Date |
Boulder, CO 80301 — = }
136085 | 2/28/18 |
Bill To [stip uD _ = a |
eee) ee
Satyanarayana Murthy Kotta Satyanarayana Murthy Kotta
Infosys Ltd, Manikonda | Infosys Ltd, Manikonda
HYDERABAD, Telangana, XXXXXX | | HYDERABAD, Telangana, XXXXXX |
India | India
eee
PO Number Terms Due Date |
#00040341 | Due on receipt 2/28/18 |
|
Payments/Adjustments
laty | Description Price (USD) | Discount (USD) | Totals |
pee
| 5.00 | Leading SAFe 4 (For Satyanarayana Murthy Kotta) $100.00 $500.00 |
| Sub-Total $500.00 |
Payment Due | $500.00 |
ee Farren Applied} ($500.00)|
Balance Due} _$0.00|
|
'
Si.
s an
KL
ae
ang
----------------------'''



sentence = Sentence(text)

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)


tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)


#print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)


2019-03-25 07:46:28,352 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
The following NER tags are found:
ORG-span [8,9]: "Airport Blvd,"
LOC-span [16]: "CO"
ORG-span [31,32,33,34,35,36,37,38]: "ee
Satyanarayana Murthy Kotta Satyanarayana Murthy Kotta
Infosys Ltd, Manikonda"
ORG-span [40]: "Infosys"
ORG-span [42,43,44]: "Manikonda
HYDERABAD, Telangana, XXXXXX"


<h1>using Spacy on some sample resume</h1>

In [0]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [0]:
text = '''aaryan vatts
Mumbai, Maharashtra - Email me on Indeed: indeed.com/r/aaryan-vatts/536d7f3aac570f70

To enhance my knowledge and capabilities by working in a dynamic organization and the leading
brand that prides itself by giving substantial responsibility to all the talents

Willing to relocate: Anywhere

WORK EXPERIENCE

sales representative

Aditya Birla Minacs Pvt. Ltd -  Mumbai, Maharashtra -

May 2014 to Present

One year

escalation manager

Barclaycard -  Mumbai, Maharashtra

I have worked with Aditya Birla minacs for facebook process for a year as a sales representative.
And now from last 2 and a half years I'm working with uk based credit card company n bank
called Barclaycard as a escalation manager

EDUCATION

west Bengal council in science and commerce

west Bengal council of higher education -  Kolkata, West Bengal

2011 to 2013

SKILLS

good communication in English, Hindi, Sanskrit and Maithili, COMPUTER KNOWLEDGE - ms dos,
word and excel and besic (3 years)

LINKS

http://vattsaaryan2011@gmail.com

AWARDS

employee of the month and year

April 2017

https://www.indeed.com/r/aaryan-vatts/536d7f3aac570f70?isid=rex-download&ikw=download-top&co=IN
http://vattsaaryan2011@gmail.com


I've been promoted from a front line agent to an escalation manager and now as team coach
(none paper)
Have also received applications from Clint and customers several time on social media. Been
called to UK to be rewarded for my work.

ADDITIONAL INFORMATION

Personal information- 
Dob - […]
Marital status - single
Nationality - Indian
Religion- Hindu

None Academic Activity -
Have also worked with some production company and done some none academic activity like tv
serials like "cid and gumrah" and concerts and have done music and acting course too'''

In [0]:
doc = nlp(text)
for entity in doc.ents:
  print(entity, entity.label_)


 GPE
Mumbai GPE
Maharashtra - Email ORG

 GPE


Willing PERSON


Aditya Birla PERSON
Minacs Pvt PERSON
Ltd -   PERSON
Mumbai GPE
Maharashtra -

 PRODUCT
May 2014 to DATE
One year DATE


Barclaycard -   PERSON
Mumbai GPE
Maharashtra

 PRODUCT
Aditya Birla PERSON
a year DATE

 GPE
last 2 and a half years DATE

 GPE
Barclaycard ORG


 DATE
Bengal GPE
Bengal GPE
Kolkata PERSON
West Bengal GPE
2011 DATE
English LANGUAGE
Hindi ORG
Sanskrit GPE
Maithili GPE

 GPE
3 years DATE
the month DATE
year

April 2017 DATE

 GPE

 GPE

 GPE
Clint ORG

 GPE
UK GPE

Dob - […]
Marital EVENT

 GPE
Nationality - Indian ORG

Religion- EVENT

 GPE
