<a href="https://colab.research.google.com/github/GabeAspir/Patent-Prior-Art-Finder/blob/main/3_BagOfWordsImplemented/3Patent_Prior_Art_Finder_Gabe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal -- To implement Bag-Of-Words!

In [1]:
import pandas as pd
import re
url = 'https://drive.google.com/file/d/18DdQd4ZPbcvOeZ6x2KRJHmvpGarnw9Qx/view?usp=sharing'
file_id = url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
dataframe = pd.read_csv(dwn_url)

We now have a data set (of 10 patents) containing three alike patents - liquid Laundry detergent Patents.

In [2]:
dataframe.describe()

Unnamed: 0,Publication_Number,Abstract,Description,Claim
count,10,10,10,10
unique,10,10,10,10
top,US-6843642-B2,A new class of avatars (“organizational avatar...,BACKGROUND OF THE INVENTION \n (a) Field of ...,"1. A method of communicating between users, th..."
freq,1,1,1,1


In [3]:
dataframe

Unnamed: 0,Publication_Number,Abstract,Description,Claim
0,US-7365259-B2,A keyboard apparatus is including a plurality ...,BACKGROUND OF THE INVENTION \n 1. Field of t...,1. A keyboard apparatus comprising:\n a suppor...
1,US-7556524-B2,"An easy-pull type swivel plug includes a body,...",BACKGROUND OF THE INVENTION \n 1. Field of t...,"1. An easy-pull type swivel plug, comprising:\..."
2,US-7338315-B2,The invention relates to a closure device comp...,FIELD OF THE INVENTION \n The invention rela...,1. A closure device comprising:\n a wall havin...
3,US-6843642-B2,An air compressor with shock-absorption rubber...,BACKGROUND OF THE INVENTION \n (a) Field of ...,1. An air compressor with shock-absorption rub...
4,US-9433212-B2,Provided is a novel plant growth regulator. Th...,TECHNICAL FIELD \n The present invention r...,The invention claimed is: \n \n 1. A...
5,US-5536436-A,A liquid laundry detergent composition contain...,FIELD OF THE INVENTION \n The present inve...,What is claimed is: \n \n 1. A heavy...
6,US-2015111807-A1,A liquid laundry detergent composition compris...,FIELD OF THE INVENTION \n The present inve...,What is claimed is: \n \n 1 . A li...
7,US-7605322-B2,As a player inputs a performance of a music pi...,TECHNICAL FIELD \n The present invention rel...,1. An apparatus for automatically starting an ...
8,US-7205268-B2,A low-foaming aqueous liquid laundry detergent...,FIELD OF THE INVENTION \n The present inve...,1. A low-foaming aqueous liquid laundry deterg...
9,US-6910186-B2,A new class of avatars (“organizational avatar...,CROSS-REFERENCE TO APPENDICES ATTACHED HERETO ...,"1. A method of communicating between users, th..."


Here's code for tokenizing a string with these restrictions - <br/> For each row, <br/>
1. split on whitespace, punctuation, apostrophes, etc.
2. Set everything to lowercase
3. Replace numbers with some unique token (e.g., “_NUM_”) -- the number 1, 100, and 1000 should all map to the same token.
4. Remove all one-letter “words”


In [4]:
def tokenize(string):
    lowercasedString = string.lower()
    #To split based on white space and random charactars
    stringArray = re.split('\W+', lowercasedString)
    #Will substitute numbers for _NUM_
    stringArray[:]= [re.sub(r"[0-9]+","_NUM_",s) for s in stringArray]
    #Will filter out 1 letter words like "I" and "a"
    stringArray = list(filter (lambda s: len (s) > 1, stringArray))
    #Will return a List/Array
    return stringArray

Now, we have to take the abstract of each patent,<br/>
and create a new column in the dataframe to put the tokenized<br/> abstract data -<br/>
This is actually really simple using Pandas!

In [5]:
dataframe['Tokenized_Abstract'] = dataframe['Abstract'].apply(tokenize)

In [6]:
dataframe

Unnamed: 0,Publication_Number,Abstract,Description,Claim,Tokenized_Abstract
0,US-7365259-B2,A keyboard apparatus is including a plurality ...,BACKGROUND OF THE INVENTION \n 1. Field of t...,1. A keyboard apparatus comprising:\n a suppor...,"[keyboard, apparatus, is, including, plurality..."
1,US-7556524-B2,"An easy-pull type swivel plug includes a body,...",BACKGROUND OF THE INVENTION \n 1. Field of t...,"1. An easy-pull type swivel plug, comprising:\...","[an, easy, pull, type, swivel, plug, includes,..."
2,US-7338315-B2,The invention relates to a closure device comp...,FIELD OF THE INVENTION \n The invention rela...,1. A closure device comprising:\n a wall havin...,"[the, invention, relates, to, closure, device,..."
3,US-6843642-B2,An air compressor with shock-absorption rubber...,BACKGROUND OF THE INVENTION \n (a) Field of ...,1. An air compressor with shock-absorption rub...,"[an, air, compressor, with, shock, absorption,..."
4,US-9433212-B2,Provided is a novel plant growth regulator. Th...,TECHNICAL FIELD \n The present invention r...,The invention claimed is: \n \n 1. A...,"[provided, is, novel, plant, growth, regulator..."
5,US-5536436-A,A liquid laundry detergent composition contain...,FIELD OF THE INVENTION \n The present inve...,What is claimed is: \n \n 1. A heavy...,"[liquid, laundry, detergent, composition, cont..."
6,US-2015111807-A1,A liquid laundry detergent composition compris...,FIELD OF THE INVENTION \n The present inve...,What is claimed is: \n \n 1 . A li...,"[liquid, laundry, detergent, composition, comp..."
7,US-7605322-B2,As a player inputs a performance of a music pi...,TECHNICAL FIELD \n The present invention rel...,1. An apparatus for automatically starting an ...,"[as, player, inputs, performance, of, music, p..."
8,US-7205268-B2,A low-foaming aqueous liquid laundry detergent...,FIELD OF THE INVENTION \n The present inve...,1. A low-foaming aqueous liquid laundry deterg...,"[low, foaming, aqueous, liquid, laundry, deter..."
9,US-6910186-B2,A new class of avatars (“organizational avatar...,CROSS-REFERENCE TO APPENDICES ATTACHED HERETO ...,"1. A method of communicating between users, th...","[new, class, of, avatars, organizational, avat..."


## NEW - <br/>

Function for BoW

In [8]:
def bagOfWordize(tokenized_abstract):
  #Create a Dictionary
  wordFrequency = {}
  for word in tokenized_abstract:
    if word not in wordFrequency.keys():
      wordFrequency[word] = 1
    else:
      wordFrequency[word] += 1
  return wordFrequency

In [9]:
dataframe['Bag_Of_Words_Abstract'] = dataframe['Tokenized_Abstract'].apply(bagOfWordize)

In [10]:
dataframe

Unnamed: 0,Publication_Number,Abstract,Description,Claim,Tokenized_Abstract,Bag_Of_Words_Abstract
0,US-7365259-B2,A keyboard apparatus is including a plurality ...,BACKGROUND OF THE INVENTION \n 1. Field of t...,1. A keyboard apparatus comprising:\n a suppor...,"[keyboard, apparatus, is, including, plurality...","{'keyboard': 3, 'apparatus': 1, 'is': 3, 'incl..."
1,US-7556524-B2,"An easy-pull type swivel plug includes a body,...",BACKGROUND OF THE INVENTION \n 1. Field of t...,"1. An easy-pull type swivel plug, comprising:\...","[an, easy, pull, type, swivel, plug, includes,...","{'an': 1, 'easy': 1, 'pull': 1, 'type': 1, 'sw..."
2,US-7338315-B2,The invention relates to a closure device comp...,FIELD OF THE INVENTION \n The invention rela...,1. A closure device comprising:\n a wall havin...,"[the, invention, relates, to, closure, device,...","{'the': 16, 'invention': 2, 'relates': 1, 'to'..."
3,US-6843642-B2,An air compressor with shock-absorption rubber...,BACKGROUND OF THE INVENTION \n (a) Field of ...,1. An air compressor with shock-absorption rub...,"[an, air, compressor, with, shock, absorption,...","{'an': 2, 'air': 7, 'compressor': 7, 'with': 2..."
4,US-9433212-B2,Provided is a novel plant growth regulator. Th...,TECHNICAL FIELD \n The present invention r...,The invention claimed is: \n \n 1. A...,"[provided, is, novel, plant, growth, regulator...","{'provided': 1, 'is': 1, 'novel': 1, 'plant': ..."
5,US-5536436-A,A liquid laundry detergent composition contain...,FIELD OF THE INVENTION \n The present inve...,What is claimed is: \n \n 1. A heavy...,"[liquid, laundry, detergent, composition, cont...","{'liquid': 2, 'laundry': 1, 'detergent': 2, 'c..."
6,US-2015111807-A1,A liquid laundry detergent composition compris...,FIELD OF THE INVENTION \n The present inve...,What is claimed is: \n \n 1 . A li...,"[liquid, laundry, detergent, composition, comp...","{'liquid': 1, 'laundry': 1, 'detergent': 1, 'c..."
7,US-7605322-B2,As a player inputs a performance of a music pi...,TECHNICAL FIELD \n The present invention rel...,1. An apparatus for automatically starting an ...,"[as, player, inputs, performance, of, music, p...","{'as': 2, 'player': 2, 'inputs': 1, 'performan..."
8,US-7205268-B2,A low-foaming aqueous liquid laundry detergent...,FIELD OF THE INVENTION \n The present inve...,1. A low-foaming aqueous liquid laundry deterg...,"[low, foaming, aqueous, liquid, laundry, deter...","{'low': 2, 'foaming': 1, 'aqueous': 1, 'liquid..."
9,US-6910186-B2,A new class of avatars (“organizational avatar...,CROSS-REFERENCE TO APPENDICES ATTACHED HERETO ...,"1. A method of communicating between users, th...","[new, class, of, avatars, organizational, avat...","{'new': 1, 'class': 1, 'of': 7, 'avatars': 9, ..."
