# Yu-Ting Shen

# RiskGenius Challenge Project


https://www.irmi.com/glossary

https://scrapy.org/

The IRMI link points to a site with definitions of insurance terms.
The Scrapy link is to a library which can extract data from websites.

The idea of this project is in 3 parts:

1. Scrape and store the IRMI glossary into some data format (maybe SQLite, or .json or something).  Be sure to have at least the definition label and definition text.  Other data might be unnecessary.

2. Build a classifier (you can choose the model) and optimize hyperparameters to predict the definition label from the definition text.

3. Predict the word that will be in the definition label, instead of the label itself.  Possibly predict the count vector of the definition label in this case.

This could have a real application in RiskGenius, as a step toward automatically generating definition labels by predicting the words that would be used in definition labels.  You are likely to find in many cases, words in the definition label cannot be found in the definition text, so keep that in mind.

***
***
***

## Load data

In [5]:
import pandas as pd

df_insurance_terms = pd.read_csv('terms.csv')
df_insurance_terms.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [6]:
df_insurance_terms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3261 entries, 0 to 3260
Data columns (total 3 columns):
term       3261 non-null object
text       3261 non-null object
synonym    18 non-null object
dtypes: object(3)
memory usage: 76.5+ KB


Most of the **synonym** are **NaN**, but there are 18 recorders not NaN.

List these 18 recorders.

In [7]:
df_insurance_terms[df_insurance_terms['synonym'].notnull()]

Unnamed: 0,term,text,synonym
215,cost of hire endorsement,A contractors equipment coverage endorsement t...,Rental cost reimbursement endorsement
273,product disparagement,A standard peril covered under a media profess...,Trade libel
314,primary insurer,A transaction in which one party the reinsur...,Reinsurance
323,preservation of property,An ocean and inland marine insurance provision...,Sue and labor clause
578,excess of loss ratio reinsurance,A form of reinsurance also known as aggregate...,Stop loss
584,fronted captive,A special-purpose insurer that operates only o...,Reinsurance captive
593,policy reserve,That portion of the policy premium that has no...,Unearned premium
769,interrelated claims provisions,Provisions within professional liability insur...,Related claims provisions
796,buyers market,One side of the market cycle that is character...,Soft market
866,inter-insurance exchange,An unincorporated group of individuals or orga...,Reciprocal company


## Convert into SQL

* Using SQLite

In [8]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///insurance_terms.sqlite', echo=False)
df_insurance_terms.to_sql('insurance_terms', con=engine)

* Load the SQL file to check

In [9]:
engine2 = create_engine('sqlite:///insurance_terms.sqlite')
table_names = engine2.table_names()
print(table_names)

sql_command = 'SELECT * FROM ' + table_names[0]
print(sql_command)

con = engine2.connect()
rs = con.execute(sql_command)
df_test_sql = pd.DataFrame(rs.fetchall())
df_test_sql.head()

['insurance_terms']
SELECT * FROM insurance_terms


Unnamed: 0,0,1,2,3
0,0,automatic premium loan,An optional provision in life insurance that a...,
1,1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,2,hydrocarbons,A class of organic compounds composed only of ...,
3,3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,4,hybrid plans,Risk financing techniques that are a combinati...,


In [10]:
df_test_sql.columns=['index', 'term', 'text', 'synonym']
df_test_sql_2 = df_test_sql.drop(['index'], axis=1)
df_test_sql_2.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [11]:
# df_test_sql.drop(['index'], axis=1).equals(df_insurance_terms)
df_test_sql_2.equals(df_insurance_terms)

True

## Convert into JSON format

* From CSV to JSON

In [32]:
import json
import numpy as np

fout = open('insurance_terms.json', 'w')

with open('terms.csv', 'r') as fin:
    for line in fin:
        split_line = line.rstrip().split(',')
        if len(split_line) == 3:
#             term = split_line[0]
#             text = split_line[1]
#             synonym = split_line[2]
       
#             print(term)
#             print(text)
#             print(synonym)
#             print('\n')
            
            d = {}
            d['term'] = split_line[0]
            d['text'] = split_line[1]
            d['synonym'] = split_line[2]
            
            if d['term'] == 'term':
                    continue
                    
#             if d['synonym'] == '':
#                 print(d['synonym'])
#                 d['synonym'] = np.nan
            
#             print(json_dict)
            json_dict = json.dumps(d)
            fout.write(json_dict)
            fout.write('\n')
    
fout.close()

* Load JSON to check

In [33]:
df_test_json = pd.read_json('insurance_terms.json', orient='columns', lines=True)
df_test_json.head()

Unnamed: 0,synonym,term,text
0,,automatic premium loan,An optional provision in life insurance that a...
1,,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,,hydrocarbons,A class of organic compounds composed only of ...
3,,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,,hybrid plans,Risk financing techniques that are a combinati...


In [34]:
len(df_test_json)

3261

In [35]:
cols = ['term', 'text', 'synonym']
df_test_json2 = df_test_json[cols].reset_index(drop=True)
df_test_json2.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [36]:
len(df_test_json2)

3261

In [37]:
df_test_json2.equals(df_insurance_terms)

False

Above is False because synonym is None in `df_insurance_terms` and is empty in `df_test_json2`. Let's compare term and text columns only.

In [38]:
df_test_json2.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [39]:
df_insurance_terms.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [40]:
df_test_json3 = df_test_json2[['term', 'text']]
df_test_json3.equals(df_insurance_terms[['term', 'text']])

True

In [43]:
df_test_json3.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,hybrid plans,Risk financing techniques that are a combinati...


In [42]:
len(df_test_json3), len(df_insurance_terms)

(3261, 3261)

* From pandas dataframe to JSON

In [44]:
df_insurance_terms.to_json(path_or_buf='insurance_terms_2.json', orient='index')

In [45]:
df_test_json4 = pd.read_json('insurance_terms_2.json', orient='index')
df_test_json4.head()

Unnamed: 0,synonym,term,text
0,,automatic premium loan,An optional provision in life insurance that a...
1,,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,,hydrocarbons,A class of organic compounds composed only of ...
3,,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,,hybrid plans,Risk financing techniques that are a combinati...


In [46]:
df_test_json4 = df_test_json4[cols]
df_test_json4.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [47]:
df_test_json4.equals(df_insurance_terms)

True

## Only keep term and text for the analysis

In [48]:
df_insurance_terms_2 = df_insurance_terms[['term', 'text']]
df_insurance_terms_2.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,hybrid plans,Risk financing techniques that are a combinati...


In [49]:
def dataframe_to_sql(df, table_name):
    from sqlalchemy import create_engine

    sqlite = 'sqlite:///' + table_name + '.sqlite'
    engine = create_engine(sqlite, echo=False)
    df.to_sql(table_name, con=engine)

In [50]:
dataframe_to_sql(df_insurance_terms_2, 'insurance_terms_2')

In [51]:
def csv_to_json(csv_file, json_file):
    import json
    import numpy as np

    fout = open(json_file, 'w')

    with open(csv_file, 'r') as fin:
        for line in fin:
            split_line = line.rstrip().split(',')
            if len(split_line) == 3:
                d = {}
                d['term'] = split_line[0]
                d['text'] = split_line[1]
#                 d['synonym'] = split_line[2]
                json_dict = json.dumps(d)
                if d['term'] == 'term':
                    continue
                fout.write(json_dict)
                fout.write('\n')
    
    fout.close()

In [52]:
csv_to_json('terms.csv', 'insurance_terms_3.json')

In [53]:
def dataframe_to_json(df, json_file):
    df.to_json(path_or_buf=json_file, orient='index')

In [54]:
dataframe_to_json(df_insurance_terms_2, 'insurance_terms_4.json')

***
***
***

### This part is just for test

In [123]:
# divide terms by the starting character
df_term_a = df_insurance_terms_2[df_insurance_terms_2['term'].str.startswith('a')]

In [124]:
df_term_a.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
2249,avoidance,A risk management technique whereby risk of lo...
2250,average weekly wage (AWW),An employee s pre-injury earning capacity bas...
2251,aviation hazard,A hazard associated with the peril of death or...
2252,aviation accident insurance,Protects employees of a company or other insur...


In [125]:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

def normalize_string(string):
    string = string.lower()
    tokenized_words = word_tokenize(string)
    stopwords_removed = [word for word in tokenized_words if word not in stop_words]
    alphabet_string = [word for word in stopwords_removed if word.isalpha()]
    return alphabet_string

In [126]:
df_term_a.loc[:, 'term'] = df_term_a.apply(lambda row: normalize_string(row['term']), axis=1)
df_term_a.loc[:, 'text'] = df_term_a.apply(lambda row: normalize_string(row['text']), axis=1)

df_term_a.head()

Unnamed: 0,term,text
0,"[automatic, premium, loan]","[optional, provision, life, insurance, authori..."
2249,[avoidance],"[risk, management, technique, whereby, risk, l..."
2250,"[average, weekly, wage, aww]","[employee, earning, capacity, based, earnings,..."
2251,"[aviation, hazard]","[hazard, associated, peril, death, disability,..."
2252,"[aviation, accident, insurance]","[protects, employees, company, insured, person..."


In [98]:
# lower case
# df_term_a['term'] = df_term_a['term'].str.lower()
# df_term_a['text'] = df_term_a['text'].str.lower()
df_term_a.loc[:,'term'] = df_term_a['term'].str.lower()
df_term_a.loc[:,'text'] = df_term_a['text'].str.lower()

In [99]:
df_term_a.head()

Unnamed: 0,term,text
0,automatic premium loan,an optional provision in life insurance that a...
2249,avoidance,a risk management technique whereby risk of lo...
2250,average weekly wage (aww),an employee s pre-injury earning capacity bas...
2251,aviation hazard,a hazard associated with the peril of death or...
2252,aviation accident insurance,protects employees of a company or other insur...


In [101]:
# Tokenize
from nltk import word_tokenize
# df_term_a['term_tokenized'] = df_term_a.apply(lambda row: word_tokenize(row['term']), axis=1)
# df_term_a['text_tokenized'] = df_term_a.apply(lambda row: word_tokenize(row['text']), axis=1)
df_term_a.loc[:, 'term_tokenized'] = df_term_a['term'].apply(word_tokenize)
df_term_a.loc[:, 'text_tokenized'] = df_term_a['text'].apply(word_tokenize)

In [102]:
df_term_a.head()

Unnamed: 0,term,text,term_tokenized,text_tokenized
0,automatic premium loan,an optional provision in life insurance that a...,"[automatic, premium, loan]","[an, optional, provision, in, life, insurance,..."
2249,avoidance,a risk management technique whereby risk of lo...,[avoidance],"[a, risk, management, technique, whereby, risk..."
2250,average weekly wage (aww),an employee s pre-injury earning capacity bas...,"[average, weekly, wage, (, aww, )]","[an, employee, s, pre-injury, earning, capacit..."
2251,aviation hazard,a hazard associated with the peril of death or...,"[aviation, hazard]","[a, hazard, associated, with, the, peril, of, ..."
2252,aviation accident insurance,protects employees of a company or other insur...,"[aviation, accident, insurance]","[protects, employees, of, a, company, or, othe..."


In [103]:
before_remove_stopwords = df_term_a.loc[0, 'text_tokenized']

In [104]:
# Remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')

# df_term_a['term_tokenized'] = df_term_a['term_tokenized'].apply(lambda x: [item for item in x if item not in stop])
# df_term_a['text_tokenized'] = df_term_a['text_tokenized'].apply(lambda x: [item for item in x if item not in stop])
df_term_a.loc[:,'term_tokenized'] = df_term_a['term_tokenized'].apply(lambda x: [item for item in x if item not in stop])
df_term_a.loc[:,'text_tokenized'] = df_term_a['text_tokenized'].apply(lambda x: [item for item in x if item not in stop])


In [105]:
df_term_a.head()

Unnamed: 0,term,text,term_tokenized,text_tokenized
0,automatic premium loan,an optional provision in life insurance that a...,"[automatic, premium, loan]","[optional, provision, life, insurance, authori..."
2249,avoidance,a risk management technique whereby risk of lo...,[avoidance],"[risk, management, technique, whereby, risk, l..."
2250,average weekly wage (aww),an employee s pre-injury earning capacity bas...,"[average, weekly, wage, (, aww, )]","[employee, pre-injury, earning, capacity, base..."
2251,aviation hazard,a hazard associated with the peril of death or...,"[aviation, hazard]","[hazard, associated, peril, death, disability,..."
2252,aviation accident insurance,protects employees of a company or other insur...,"[aviation, accident, insurance]","[protects, employees, company, insured, person..."


In [106]:
after_remove_stopwords = df_term_a.loc[0, 'text_tokenized']

In [107]:
before_remove_stopwords

['an',
 'optional',
 'provision',
 'in',
 'life',
 'insurance',
 'that',
 'authorizes',
 'the',
 'insurer',
 'to',
 'pay',
 'from',
 'the',
 'cash',
 'value',
 'any',
 'premium',
 'due',
 'at',
 'the',
 'end',
 'of',
 'the',
 'grace',
 'period',
 'this',
 'provision',
 'is',
 'useful',
 'in',
 'preventing',
 'inadvertent',
 'lapse',
 'of',
 'the',
 'policy']

In [108]:
after_remove_stopwords

['optional',
 'provision',
 'life',
 'insurance',
 'authorizes',
 'insurer',
 'pay',
 'cash',
 'value',
 'premium',
 'due',
 'end',
 'grace',
 'period',
 'provision',
 'useful',
 'preventing',
 'inadvertent',
 'lapse',
 'policy']

In [109]:
for item in before_remove_stopwords:
    if item not in after_remove_stopwords:
        print(item)

an
in
that
the
to
from
the
any
at
the
of
the
this
is
in
of
the


In [110]:
# keep only alphabet
# df_term_a['term_tokenized'] = df_term_a['term_tokenized'].apply(lambda x: [item for item in x if item.isalpha()])
# df_term_a['text_tokenized'] = df_term_a['text_tokenized'].apply(lambda x: [item for item in x if item.isalpha()])
df_term_a.loc[:,'term_tokenized'] = df_term_a['term_tokenized'].apply(lambda x: [item for item in x if item.isalpha()])
df_term_a.loc[:,'text_tokenized'] = df_term_a['text_tokenized'].apply(lambda x: [item for item in x if item.isalpha()])

In [111]:
df_term_a.head()

Unnamed: 0,term,text,term_tokenized,text_tokenized
0,automatic premium loan,an optional provision in life insurance that a...,"[automatic, premium, loan]","[optional, provision, life, insurance, authori..."
2249,avoidance,a risk management technique whereby risk of lo...,[avoidance],"[risk, management, technique, whereby, risk, l..."
2250,average weekly wage (aww),an employee s pre-injury earning capacity bas...,"[average, weekly, wage, aww]","[employee, earning, capacity, based, earnings,..."
2251,aviation hazard,a hazard associated with the peril of death or...,"[aviation, hazard]","[hazard, associated, peril, death, disability,..."
2252,aviation accident insurance,protects employees of a company or other insur...,"[aviation, accident, insurance]","[protects, employees, company, insured, person..."


In [121]:
X = documents.values
y = labels.values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

vec_trained = vectorizer.fit_transform(X_train)

vec_trained = vec_trained.toarray()

words = vectorizer.get_feature_names()

vec_test = vectorizer.transform(X_test)

vec_test = vec_test.toarray()

In [122]:
vec_trained.shape

(2282, 5000)

In [123]:
words

['00',
 '000',
 '01',
 '02',
 '03',
 '04',
 '05',
 '06',
 '08',
 '10',
 '100',
 '11',
 '110',
 '113',
 '114',
 '12',
 '125',
 '13',
 '15',
 '159',
 '17',
 '18',
 '1913',
 '1950s',
 '1964',
 '1970',
 '1973',
 '1974',
 '1980s',
 '1981',
 '1986',
 '1988',
 '1989',
 '1990',
 '1991',
 '1995',
 '1998',
 '1999',
 '20',
 '2000',
 '2000s',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2012',
 '2013',
 '2014',
 '2015',
 '2020',
 '21',
 '23',
 '24',
 '25',
 '250',
 '26',
 '28',
 '30',
 '300',
 '31',
 '32',
 '35',
 '40',
 '401',
 '41',
 '45',
 '48',
 '50',
 '500',
 '501',
 '510',
 '53',
 '54',
 '55',
 '60',
 '63',
 '65',
 '70',
 '72',
 '73',
 '831',
 '90',
 '96',
 '99',
 'a201',
 'aais',
 'abandoned',
 'abatement',
 'abbreviated',
 'abc',
 'abilities',
 'ability',
 'able',
 'absence',
 'absent',
 'absolute',
 'absolve',
 'abuse',
 'abuses',
 'acc',
 'accept',
 'acceptability',
 'acceptable',
 'acceptance',
 'accepted',
 'accepting',
 'accepts',
 'acces

In [None]:
vec_test = vectorizer.transform(X_test)

Because there 3261 terms corresponding to 3261 labels. First I categorize the label into different clusters. And I want to know how many unique words in documents and labels? So I can decide the size of vocabulary.

In [82]:
docs = []
for index, text in documents.iteritems():
#     print(text)
    docs.append(text)

In [83]:
docs

['An optional provision in life insurance that authorizes the insurer to pay from the cash value any premium due at the end of the grace period  This provision is useful in preventing inadvertent lapse of the policy ',
 'Provided a nonjudicial dispute settlement program for the carriage of household goods ',
 'A class of organic compounds composed only of carbon and hydrogen  Common hydrocarbons include natural gas  crude oil  and coal  Hydrocarbons are the primary source of the world s electric energy and heat sources due to the power created when they are burned ',
 'A process in which fractures in hard-to-reach shale rock formations below the earth s surface are opened and widened by injecting water  chemicals  and sand under high pressure to extract natural gas and oil trapped in the shale rocks  It enables wells that would flow only at very low rates to produce oil and gas in commercially viable volumes ',
 'Risk financing techniques that are a combination of retention and transfe

In [84]:
from nltk.tokenize import word_tokenize

tokenized = [word_tokenize(text.lower()) for text in docs]
tokenized

[['an',
  'optional',
  'provision',
  'in',
  'life',
  'insurance',
  'that',
  'authorizes',
  'the',
  'insurer',
  'to',
  'pay',
  'from',
  'the',
  'cash',
  'value',
  'any',
  'premium',
  'due',
  'at',
  'the',
  'end',
  'of',
  'the',
  'grace',
  'period',
  'this',
  'provision',
  'is',
  'useful',
  'in',
  'preventing',
  'inadvertent',
  'lapse',
  'of',
  'the',
  'policy'],
 ['provided',
  'a',
  'nonjudicial',
  'dispute',
  'settlement',
  'program',
  'for',
  'the',
  'carriage',
  'of',
  'household',
  'goods'],
 ['a',
  'class',
  'of',
  'organic',
  'compounds',
  'composed',
  'only',
  'of',
  'carbon',
  'and',
  'hydrogen',
  'common',
  'hydrocarbons',
  'include',
  'natural',
  'gas',
  'crude',
  'oil',
  'and',
  'coal',
  'hydrocarbons',
  'are',
  'the',
  'primary',
  'source',
  'of',
  'the',
  'world',
  's',
  'electric',
  'energy',
  'and',
  'heat',
  'sources',
  'due',
  'to',
  'the',
  'power',
  'created',
  'when',
  'they',
  'ar

In [85]:
from nltk.corpus import stopwords

stop = set(stopwords.words('english'))

docs = [[word for word in words if  word not in stop] for words in tokenized] # words 是一個列 word 是該列的一個單字
docs

[['optional',
  'provision',
  'life',
  'insurance',
  'authorizes',
  'insurer',
  'pay',
  'cash',
  'value',
  'premium',
  'due',
  'end',
  'grace',
  'period',
  'provision',
  'useful',
  'preventing',
  'inadvertent',
  'lapse',
  'policy'],
 ['provided',
  'nonjudicial',
  'dispute',
  'settlement',
  'program',
  'carriage',
  'household',
  'goods'],
 ['class',
  'organic',
  'compounds',
  'composed',
  'carbon',
  'hydrogen',
  'common',
  'hydrocarbons',
  'include',
  'natural',
  'gas',
  'crude',
  'oil',
  'coal',
  'hydrocarbons',
  'primary',
  'source',
  'world',
  'electric',
  'energy',
  'heat',
  'sources',
  'due',
  'power',
  'created',
  'burned'],
 ['process',
  'fractures',
  'hard-to-reach',
  'shale',
  'rock',
  'formations',
  'earth',
  'surface',
  'opened',
  'widened',
  'injecting',
  'water',
  'chemicals',
  'sand',
  'high',
  'pressure',
  'extract',
  'natural',
  'gas',
  'oil',
  'trapped',
  'shale',
  'rocks',
  'enables',
  'wells',
 

In [86]:
# docs[0]

vocab = [word for words in docs for word in words]
vocab

['optional',
 'provision',
 'life',
 'insurance',
 'authorizes',
 'insurer',
 'pay',
 'cash',
 'value',
 'premium',
 'due',
 'end',
 'grace',
 'period',
 'provision',
 'useful',
 'preventing',
 'inadvertent',
 'lapse',
 'policy',
 'provided',
 'nonjudicial',
 'dispute',
 'settlement',
 'program',
 'carriage',
 'household',
 'goods',
 'class',
 'organic',
 'compounds',
 'composed',
 'carbon',
 'hydrogen',
 'common',
 'hydrocarbons',
 'include',
 'natural',
 'gas',
 'crude',
 'oil',
 'coal',
 'hydrocarbons',
 'primary',
 'source',
 'world',
 'electric',
 'energy',
 'heat',
 'sources',
 'due',
 'power',
 'created',
 'burned',
 'process',
 'fractures',
 'hard-to-reach',
 'shale',
 'rock',
 'formations',
 'earth',
 'surface',
 'opened',
 'widened',
 'injecting',
 'water',
 'chemicals',
 'sand',
 'high',
 'pressure',
 'extract',
 'natural',
 'gas',
 'oil',
 'trapped',
 'shale',
 'rocks',
 'enables',
 'wells',
 'would',
 'flow',
 'low',
 'rates',
 'produce',
 'oil',
 'gas',
 'commercially',
 

In [87]:
from nltk.stem.wordnet import WordNetLemmatizer
wordnet = WordNetLemmatizer()
vocab = [wordnet.lemmatize(word) for word in vocab]
vocab

['optional',
 'provision',
 'life',
 'insurance',
 'authorizes',
 'insurer',
 'pay',
 'cash',
 'value',
 'premium',
 'due',
 'end',
 'grace',
 'period',
 'provision',
 'useful',
 'preventing',
 'inadvertent',
 'lapse',
 'policy',
 'provided',
 'nonjudicial',
 'dispute',
 'settlement',
 'program',
 'carriage',
 'household',
 'good',
 'class',
 'organic',
 'compound',
 'composed',
 'carbon',
 'hydrogen',
 'common',
 'hydrocarbon',
 'include',
 'natural',
 'gas',
 'crude',
 'oil',
 'coal',
 'hydrocarbon',
 'primary',
 'source',
 'world',
 'electric',
 'energy',
 'heat',
 'source',
 'due',
 'power',
 'created',
 'burned',
 'process',
 'fracture',
 'hard-to-reach',
 'shale',
 'rock',
 'formation',
 'earth',
 'surface',
 'opened',
 'widened',
 'injecting',
 'water',
 'chemical',
 'sand',
 'high',
 'pressure',
 'extract',
 'natural',
 'gas',
 'oil',
 'trapped',
 'shale',
 'rock',
 'enables',
 'well',
 'would',
 'flow',
 'low',
 'rate',
 'produce',
 'oil',
 'gas',
 'commercially',
 'viable',
 

In [88]:
print(len(vocab))

vocab = set(vocab)
print(len(vocab))

107304
8067


There are 8067 unique words in document.

Wrap above into function

In [100]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

def get_vocabulary(docs):
    tokenized = [word_tokenize(text.lower()) for text in docs]
    stop = set(stopwords.words('english'))
    docs = [[word for word in words if word not in stop] for words in tokenized]
    vocab = [word for words in docs for word in words]

    wordnet = WordNetLemmatizer()
    vocab = [wordnet.lemmatize(word) for word in vocab]
    vocab = set(vocab)
    return vocab

In [101]:
terms = []
texts = []

for index, row in df_insurance_terms_2.iterrows():
#     print(index, row[0], row[1])
    terms.append(row[0])
    texts.append(row[1])
    
vocab_terms = get_vocabulary(terms)
vocab_texts = get_vocabulary(texts)

In [103]:
print(len(vocab_terms), len(vocab_texts))

2873 8067


So there are 2873 unique words in terms and 8067 unique words in texts.
Let me set vocabulary for terms and texts to be (1/3)*size in the analysis.

In [104]:
max_size_terms = 2873//3
max_size_texts = 8067//3
print(max_size_terms, max_size_texts)

957 2689


In [105]:
X = documents.values
y = labels.values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=42)

In [111]:
y_train

array(['property damage (PD)', 'impaired risk', 'common carrier',
       'employees as insureds endorsement', 'graduated drivers licenses',
       'rectification coverage', 'risk sharing', 'maturity date',
       'delayed completion coverage',
       'fire department service charge coverage',
       'series business unit (SBU)',
       'alternative dispute resolution (ADR)', 'expiration',
       'increased limits table', 'contingent commission',
       'additional medical', 'zone rating', 'index-based contracts',
       'Pension Protection Act of 2006', 'setoff',
       'best demonstrated available technology (BDAT)', 'actuary',
       'risk concentration', 'net loss reserves',
       'continuation of coverage in bankruptcy provision',
       'legal action against insurer', 'tortfeasor', 'damages',
       'homeowners modified form 8 (HO 8)',
       'Quality Indicator Profile (QIP)', 'betterment clause',
       'contribution by equal shares', 'water quality-based limitations',
       'c

In [106]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_X = TfidfVectorizer(stop_words='english', max_features=max_size_texts)
vec_X_trained = vectorizer_X.fit_transform(X_train)

vectorizer_y = TfidfVectorizer(stop_words='english', max_features=max_size_terms)
vec_y_trained = vectorizer_y.fit_transform(y_train)

In [112]:
vocab_X = vectorizer_X.get_feature_names()
vocab_X

['00',
 '000',
 '05',
 '10',
 '100',
 '110',
 '12',
 '125',
 '15',
 '17',
 '18',
 '1980s',
 '1986',
 '1989',
 '1990',
 '1998',
 '20',
 '2000',
 '2008',
 '2012',
 '2014',
 '21',
 '24',
 '25',
 '250',
 '30',
 '32',
 '401',
 '41',
 '50',
 '500',
 '501',
 '54',
 '55',
 '60',
 '65',
 '72',
 '90',
 '99',
 'a201',
 'aais',
 'abandoned',
 'abc',
 'ability',
 'able',
 'absence',
 'absolute',
 'abuse',
 'abuses',
 'accept',
 'acceptability',
 'acceptable',
 'acceptance',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accessible',
 'accessing',
 'accident',
 'accidental',
 'accordance',
 'according',
 'accordingly',
 'account',
 'accountants',
 'accounting',
 'accounts',
 'accrued',
 'accurate',
 'acquired',
 'acquiring',
 'acquisition',
 'act',
 'action',
 'actions',
 'active',
 'activities',
 'activity',
 'acts',
 'actual',
 'actually',
 'actuarial',
 'acute',
 'acv',
 'ada',
 'add',
 'added',
 'adding',
 'addition',
 'additional',
 'additionally',
 'address',
 'adequate',
 'adjust',
 'adj

In [113]:
vocab_y = vectorizer_y.get_feature_names()
vocab_y

['114',
 '1386',
 '15',
 '1927',
 '1933',
 '1936',
 '1970',
 '1973',
 '1975',
 '1977',
 '1980',
 '1983',
 '1984',
 '1986',
 '1994',
 '1996',
 '1998',
 '2000',
 '2002',
 '2005',
 'abatement',
 'accident',
 'account',
 'accounting',
 'act',
 'action',
 'add',
 'added',
 'additional',
 'adequacy',
 'adjusted',
 'adjustment',
 'administration',
 'administrator',
 'admiralty',
 'admitted',
 'adr',
 'adverse',
 'affiliated',
 'agency',
 'agent',
 'agents',
 'aggregate',
 'aggregates',
 'agreed',
 'agreement',
 'agreements',
 'agricultural',
 'aiaf',
 'aim',
 'air',
 'aircraft',
 'airport',
 'ait',
 'aleatory',
 'alien',
 'alienated',
 'allowance',
 'alternative',
 'amendment',
 'american',
 'analysis',
 'analytic',
 'animal',
 'annuity',
 'antwerp',
 'api',
 'appeal',
 'appear',
 'appetite',
 'apportionment',
 'appraisers',
 'approach',
 'approval',
 'architects',
 'arm',
 'arrest',
 'asbestos',
 'assessment',
 'assessments',
 'associate',
 'association',
 'assumed',
 'auto',
 'automatic',
 

Use grid search to find the optimized k for KMeans.

In [114]:
vec_labels = vectorizer_y.transform(labels)

from sklearn.cluster import KMeans
n=8
kmeans = KMeans(n_clusters=n).fit(vec_trained)
print(kmeans.labels_.size)
kmeans.labels_

978


array([2, 7, 4, 3, 6, 5, 7, 4, 2, 2, 6, 6, 4, 6, 0, 1, 6, 7, 6, 0, 6, 0,
       2, 7, 2, 5, 6, 1, 2, 6, 2, 0, 6, 4, 6, 2, 1, 7, 6, 5, 2, 6, 4, 4,
       6, 6, 4, 2, 6, 2, 5, 2, 2, 2, 5, 5, 0, 5, 6, 1, 5, 7, 7, 6, 1, 1,
       2, 2, 6, 0, 6, 3, 7, 5, 7, 2, 6, 7, 2, 2, 1, 3, 2, 1, 5, 6, 2, 2,
       6, 4, 7, 2, 1, 3, 6, 5, 6, 6, 0, 6, 6, 6, 5, 7, 0, 6, 6, 3, 5, 0,
       5, 7, 2, 0, 6, 7, 7, 6, 6, 6, 2, 2, 6, 2, 7, 0, 4, 2, 6, 6, 5, 6,
       6, 2, 5, 4, 6, 4, 1, 4, 6, 0, 3, 2, 4, 0, 1, 4, 6, 6, 7, 5, 1, 6,
       0, 5, 6, 5, 0, 5, 2, 6, 5, 5, 6, 6, 5, 0, 5, 1, 6, 6, 2, 4, 0, 6,
       2, 6, 5, 2, 0, 6, 1, 6, 6, 6, 6, 5, 6, 6, 0, 7, 6, 6, 6, 6, 3, 5,
       1, 2, 6, 2, 2, 2, 3, 6, 7, 2, 6, 7, 3, 4, 0, 0, 6, 6, 2, 5, 6, 6,
       5, 4, 6, 5, 6, 7, 2, 2, 6, 0, 6, 3, 5, 5, 1, 5, 7, 7, 2, 0, 7, 5,
       5, 7, 6, 2, 3, 5, 6, 1, 5, 2, 2, 2, 5, 1, 7, 3, 6, 0, 2, 6, 5, 5,
       6, 0, 0, 6, 5, 2, 2, 6, 3, 6, 2, 1, 7, 0, 6, 2, 2, 1, 6, 3, 2, 2,
       2, 7, 6, 4, 6, 2, 6, 6, 5, 5, 6, 1, 6, 2, 2,

In [None]:
from sklearn.model_selection import GridSearchCV