### Before we get to entitiy recognition, let us observe an approach to extractive text summerization

In [1]:
import pandas as pd
import nltk
import matplotlib.pyplot as plt

In [2]:
from nltk.corpus import stopwords

pd.set_option('display.max_colwidth',1000)
pd.set_option('display.max_columns', None)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
story=["long ago , the mice had a general council to consider what measures they could take to outwit their common enemy , the cat . some said this , and some said that but at last a young mouse got up and said he had a proposal to make , which he thought would meet the case . you will all agree , said he , that our chief danger consists in the sly and treacherous manner in which the enemy approaches us . now , if we could receive some signal of her approach , we could easily escape from her . i venture , therefore , to propose that a small bell be procured , and attached by a ribbon round the neck of the cat . by this means we should always know when she was about , and could easily retire while she was in the neighbourhood . this proposal met with general applause , until an old mouse got up and said that is all very well , but who is to bell the cat ? the mice looked at one another and nobody spoke . then the old mouse said it is easy to propose impossible remedies"]


In [5]:
story

['long ago , the mice had a general council to consider what measures they could take to outwit their common enemy , the cat . some said this , and some said that but at last a young mouse got up and said he had a proposal to make , which he thought would meet the case . you will all agree , said he , that our chief danger consists in the sly and treacherous manner in which the enemy approaches us . now , if we could receive some signal of her approach , we could easily escape from her . i venture , therefore , to propose that a small bell be procured , and attached by a ribbon round the neck of the cat . by this means we should always know when she was about , and could easily retire while she was in the neighbourhood . this proposal met with general applause , until an old mouse got up and said that is all very well , but who is to bell the cat ? the mice looked at one another and nobody spoke . then the old mouse said it is easy to propose impossible remedies']

### Using TFIDF Vectorizer to fetch important words from the story

In [6]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(story)
print(vectorizer.get_feature_names())

['about', 'ago', 'agree', 'all', 'always', 'an', 'and', 'another', 'applause', 'approach', 'approaches', 'at', 'attached', 'be', 'bell', 'but', 'by', 'case', 'cat', 'chief', 'common', 'consider', 'consists', 'could', 'council', 'danger', 'easily', 'easy', 'enemy', 'escape', 'from', 'general', 'got', 'had', 'he', 'her', 'if', 'impossible', 'in', 'is', 'it', 'know', 'last', 'long', 'looked', 'make', 'manner', 'means', 'measures', 'meet', 'met', 'mice', 'mouse', 'neck', 'neighbourhood', 'nobody', 'now', 'of', 'old', 'one', 'our', 'outwit', 'procured', 'proposal', 'propose', 'receive', 'remedies', 'retire', 'ribbon', 'round', 'said', 'she', 'should', 'signal', 'sly', 'small', 'some', 'spoke', 'take', 'that', 'the', 'their', 'then', 'therefore', 'they', 'this', 'thought', 'to', 'treacherous', 'until', 'up', 'us', 'venture', 'very', 'was', 'we', 'well', 'what', 'when', 'which', 'while', 'who', 'will', 'with', 'would', 'you', 'young']


#### As you can see, the TFIDF ranks irrelevant words as important, this is primarily because there is not enough similar text to derive the inverse document frequency for this, in such cases we will go with normal frequency count and sentence ranking.
##### Seperate script is provided for this

##### we will limit this to sentence ranking, with Microsoft Bot framework, this can be put to many other use cases.

## Entity recognition - Identifying policy number from text document

In [7]:
policy1=pd.read_table("1_pol.txt", error_bad_lines=False)

In [8]:
policy1

Unnamed: 0,AAAAAAAAAA
0,Policy Information:
1,Policy Holder Details Contact Us
2,"TAILORED TRAINING PROGRAMS, LLC Business Service Center"
3,Business Hours: Monday - Friday
4,Policy Number Policy Term
5,99 XYX JK123 08/19/18 to 08/19/19
6,Producer's Name: ABC SERVICES
7,Producer's Code: 22270548
8,Producer's Fact Sheet
9,Account Details:


#### trial, lexical analysis using tokenization

In [9]:
wordtokens=policy1["AAAAAAAAAA"].apply(lambda x:nltk.word_tokenize(x))

In [10]:
sentencetokens=policy1["AAAAAAAAAA"].apply(lambda x:nltk.sent_tokenize(x))

In [11]:
sentencetokens

0                                                                           [Policy Information:]
1                        [Policy Holder Details                                       Contact Us]
2                    [TAILORED TRAINING PROGRAMS, LLC                    Business Service Center]
3            [                                                   Business Hours: Monday - Friday]
4                                                          [Policy Number            Policy Term]
5                                                  [99 XYX JK123            08/19/18 to 08/19/19]
6                                                               [Producer's Name:   ABC SERVICES]
7                                                                   [Producer's Code:   22270548]
8                                                [                         Producer's Fact Sheet]
9                                                                              [Account Details:]
10                  

#### Parts of Speech tagging

In [12]:
postags=sentencetokens.apply(lambda x:nltk.pos_tag(x,tagset='universal'))

In [13]:
postags

0                                                                           [(Policy Information:, NOUN)]
1                        [(Policy Holder Details                                       Contact Us, NOUN)]
2                    [(TAILORED TRAINING PROGRAMS, LLC                    Business Service Center, NOUN)]
3             [(                                                   Business Hours: Monday - Friday, ADJ)]
4                                                          [(Policy Number            Policy Term, NOUN)]
5                                                   [(99 XYX JK123            08/19/18 to 08/19/19, NUM)]
6                                                               [(Producer's Name:   ABC SERVICES, NOUN)]
7                                                                   [(Producer's Code:   22270548, NOUN)]
8                                                [(                         Producer's Fact Sheet, NOUN)]
9                                             

#### for text documents, it is thus observed, regex matching would be a better option, this does not need a machine learning solution

In [14]:
import re

In [20]:
def findpolicy(policy):
    for tokens in policy.iterrows():
        test=re.search(r'([0-9]{2}) ([A-Z]{3}) ([0-9A-Z]{5})',str(tokens[1]))
        if test is not None:
            print(test.group())
        

#### this can then be encapsulated as a function to facilitate re use

In [17]:
policy2=pd.read_table("2_pol.txt", error_bad_lines=False)

In [19]:
findpolicy(policy2)

99 XYX JK123
