# Patent Assignment Daily
Contains daily patent assignment text for 10/18/2016

In [1]:
import pandas as pd
import numpy as np

In [2]:
assignments = pd.read_csv('patent_assignment.csv', index_col= 0)

In [3]:
assignments.head()

Unnamed: 0,last-update-date,patent-assignees,patent-assignors,patent-countries,patent-dates,patent-kinds,patent-numbers,recorded-date,title
0,20161018,FASTCASE,"WALTERS, EDWARD J. III|ROSENTHAL, PHILIP J.",US|US,20001108|20161018,X0|B1,09707911|9471672,20010320,Relevance sorting for database searches
1,20161018,ANABASIS SRL,"LAMBIASE, ALESSANDRO",US|US,20010726|20161018,X0|B1,09890088|9468665,20010720,METHOD OF TREATING INTRAOCCULAR TISSUE PATHOLO...
2,20161018,QUALCOMM INCORPORATED,"WALTON, J. RODNEY|KETCHUM, JOHN W.",US|US|US,20031201|20050602|20161018,X0|A1|B2,10725904|20050120097|9473269,20031201,METHOD AND APPARATUS FOR PROVIDING AN EFFICIEN...
3,20161018,INTERNATIONAL BUSINESS MACHINES CORPORATION,"MORARIU, JANIS A.|STAPEL, STEVEN W.|STRAACH, J...",US|US|US,20040622|20051222|20161018,X0|A1|B2,10873346|20050282136|9472114,20040903,"COMPUTER-IMPLEMENTED METHOD, SYSTEM AND PROGRA..."
4,20161018,INTERNATIONAL BUSINESS MACHINES CORPORATION,"LI, XIN|ROBERTS, GREGORY WAYNE",US|US|US,20041019|20060420|20161018,X0|A1|B2,10967958|20060085754|9471332,20050208,Selecting graphical component types at runtime


In [4]:
titles_in_data=assignments['title']

In [5]:
chars = list(set(titles_in_data)) 

In [6]:
data_size, vocab_size = len(titles_in_data), len(chars)
print ("Patent data has chars", data_size)
print ("Patent data has unique chars", vocab_size)

Patent data has chars 7530
Patent data has unique chars 6441


## Topic Modeling

In [7]:
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

import nltk
import gensim

### tokenize

The first thing we have to do is **tokenize** our words. A naive way to do this would be to split our string based on spaces (e.g. str.split(" ")), which is sometimes OK but has many edge cases (alternative punctuation marks like —, for example) and will fail to work as expected for larger problems.

nltk comes with a built-in word tokenizer that we can take advantage of.

In [8]:
titles = assignments['title']
title_tokens = [nltk.word_tokenize(title) for title in\
                    np.concatenate(titles.map(str).map(str.title).map(lambda s: s.split("|")))]

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hassan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
title_tokens = [title for title in title_tokens if len(title_tokens) > 0]

In [11]:
len(title_tokens)

370460

In [12]:
title_tokens[:3]

[['Relevance', 'Sorting', 'For', 'Database', 'Searches'],
 ['Method',
  'Of',
  'Treating',
  'Intraoccular',
  'Tissue',
  'Pathologies',
  'With',
  'Nerve',
  'Growth',
  'Factor',
  '.'],
 ['Method',
  'And',
  'Apparatus',
  'For',
  'Providing',
  'An',
  'Efficient',
  'Control',
  'Channel',
  'Structure',
  'In',
  'A',
  'Wireless',
  'Communication',
  'System']]

In [13]:
print(title_tokens[2])

['Method', 'And', 'Apparatus', 'For', 'Providing', 'An', 'Efficient', 'Control', 'Channel', 'Structure', 'In', 'A', 'Wireless', 'Communication', 'System']


### stem

Next, we will stem our words. Stemming is a procedure in natural language processing where we chop off everything except for the root of a word. So for example, the words go, going, and gone will all map to the same root—go.

This is a good thing to do, particularly given the small size of our documents, because it increases the accuracy of classifications—more things end up being the same.

nltk comes with several stemmers installed, we'll use the PorterStemmer

In [14]:
stemmer = nltk.stem.PorterStemmer()
titles_stemmed = [[stemmer.stem(token) for token in tokens] for tokens in title_tokens]

In [15]:
titles_stemmed[:3]

[['relev', 'sort', 'for', 'databas', 'search'],
 ['method',
  'Of',
  'treat',
  'intraoccular',
  'tissu',
  'patholog',
  'with',
  'nerv',
  'growth',
  'factor',
  '.'],
 ['method',
  'and',
  'apparatu',
  'for',
  'provid',
  'An',
  'effici',
  'control',
  'channel',
  'structur',
  'In',
  'A',
  'wireless',
  'commun',
  'system']]

In [16]:
print(titles_stemmed[2])

['method', 'and', 'apparatu', 'for', 'provid', 'An', 'effici', 'control', 'channel', 'structur', 'In', 'A', 'wireless', 'commun', 'system']


If we examine a list of words, however, we see that the most common English-language words dominate:

In [17]:
print(pd.Series(np.concatenate(titles_stemmed)).value_counts())

and                         170856
for                         145224
method                      137910
A                           101633
Of                           91017
system                       75843
devic                        65015
with                         46433
In                           39461
,                            37235
apparatu                     31147
circuit                      29046
memori                       28675
semiconductor                26290
use                          24697
To                           23959
data                         22710
An                           22235
control                      20957
the                          20244
have                         17662
integr                       16994
process                      16233
structur                     16052
network                      14185
form                         14012
On                           12411
commun                       12408
power               

These words carry no meaning and aren't very interesting.For example, we can see "," has occurance of  37235. <br>

They're known as **stopwords** in NLP, and we're going to once again use  nltk builtins to remove them from consideration.

### stopwords

In [18]:
from nltk.corpus import stopwords

In [19]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hassan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
english_stopwords = set([word.title() for word in stopwords.words("english")])

In [21]:
stemmed_title_words = [[word for word in title if word not in english_stopwords] for title in titles_stemmed]

In [22]:
stemmed_title_words[:3]

[['relev', 'sort', 'for', 'databas', 'search'],
 ['method',
  'treat',
  'intraoccular',
  'tissu',
  'patholog',
  'with',
  'nerv',
  'growth',
  'factor',
  '.'],
 ['method',
  'and',
  'apparatu',
  'for',
  'provid',
  'effici',
  'control',
  'channel',
  'structur',
  'wireless',
  'commun',
  'system']]

In [23]:
print(stemmed_title_words[2])

['method', 'and', 'apparatu', 'for', 'provid', 'effici', 'control', 'channel', 'structur', 'wireless', 'commun', 'system']


In [24]:
word_counts = pd.Series(np.concatenate(stemmed_title_words)).value_counts()
singular_words = set(word_counts[pd.Series(np.concatenate(stemmed_title_words)).value_counts() == 1].index)

In [25]:
stemmed_title_common_words = [[word for word in title if word not in singular_words] for title in stemmed_title_words]

In [26]:
stemmed_title_common_words[:3]

[['relev', 'sort', 'for', 'databas', 'search'],
 ['method',
  'treat',
  'intraoccular',
  'tissu',
  'patholog',
  'with',
  'nerv',
  'growth',
  'factor',
  '.'],
 ['method',
  'and',
  'apparatu',
  'for',
  'provid',
  'effici',
  'control',
  'channel',
  'structur',
  'wireless',
  'commun',
  'system']]

In [27]:
print(stemmed_title_common_words[2])

['method', 'and', 'apparatu', 'for', 'provid', 'effici', 'control', 'channel', 'structur', 'wireless', 'commun', 'system']


Next, let's consider the opposite problem: words that occur to infrequently to be useful. Words that only ever appear once, for example, don't carry any information. Remember, we're going to split all of our patent titles into some small number of classes; just as in any other dataset, a data point which is only populated once isn't interesting, and can be safely dropped.

In fact, we could probably drop a lot of words from consideration, not just ones appearing once but ones appearing tens or even hundreds of times. This would speed up our algorithms and won't significantly impact our results.

After a certain point words do start to matter, however; figuring out where that point is is up to you.

In our case we'll just be lazy and cut off at words that appear only once, and leave words appearing twice or more intact.

In [28]:
non_empty_indices = [i for i in range(len(stemmed_title_common_words)) if len(stemmed_title_common_words[i]) > 0]

In [29]:
non_empty_indices[5000]

5003

Notice that discarding words from our set has resulted in a handful of empty titles. Apparently a few patents have nothing but unique words!

In [30]:
stemmed_title_common_words_nonnull = np.asarray(stemmed_title_common_words)[non_empty_indices]

In [31]:
classifiable_titles = np.asarray(title_tokens)[non_empty_indices]

In [32]:
classifiable_titles[:5]

array([list(['Relevance', 'Sorting', 'For', 'Database', 'Searches']),
       list(['Method', 'Of', 'Treating', 'Intraoccular', 'Tissue', 'Pathologies', 'With', 'Nerve', 'Growth', 'Factor', '.']),
       list(['Method', 'And', 'Apparatus', 'For', 'Providing', 'An', 'Efficient', 'Control', 'Channel', 'Structure', 'In', 'A', 'Wireless', 'Communication', 'System']),
       list(['Computer-Implemented', 'Method', ',', 'System', 'And', 'Program', 'Product', 'For', 'Providing', 'An', 'Educational', 'Program']),
       list(['Selecting', 'Graphical', 'Component', 'Types', 'At', 'Runtime'])],
      dtype=object)

With our titles adequately processed, now we switch over to gensim. The first thing we have to do is build a dictionary of words, which associates each word [stem] with a particular index number:

In [33]:
dictionary = gensim.corpora.Dictionary(stemmed_title_common_words_nonnull)

In [34]:
str(dictionary.token2id)[:1000]

"{'databas': 0, 'for': 1, 'relev': 2, 'search': 3, 'sort': 4, '.': 5, 'factor': 6, 'growth': 7, 'intraoccular': 8, 'method': 9, 'nerv': 10, 'patholog': 11, 'tissu': 12, 'treat': 13, 'with': 14, 'and': 15, 'apparatu': 16, 'channel': 17, 'commun': 18, 'control': 19, 'effici': 20, 'provid': 21, 'structur': 22, 'system': 23, 'wireless': 24, ',': 25, 'computer-impl': 26, 'educ': 27, 'product': 28, 'program': 29, 'compon': 30, 'graphic': 31, 'runtim': 32, 'select': 33, 'type': 34, 'aggreg': 35, 'chromatographi': 36, 'high': 37, 'hydroxyapatit': 38, 'molecular': 39, 'remov': 40, 'use': 41, 'weight': 42, 'aid': 43, 'differ': 44, 'further': 45, 'protocol': 46, 'station': 47, 'the': 48, 'transpond': 49, 'convert': 50, 'input': 51, 'languag': 52, 'output': 53, 'phonet': 54, 'written': 55, 'dataset': 56, 'from': 57, 'link': 58, 'methodolog': 59, 'multi-mod': 60, 'pattern': 61, 'charact': 62, 'digit': 63, 'media': 64, 'person': 65, 'replac': 66, 'airway': 67, 'detect': 68, 'instabl': 69, 'align': 7

Why are we doing this? Because shortly we're going to throw our corpus into a tf-idf algorithm. TF-IDF is an algorithm in information retrieval which converts a list of word "vectors" to a scaled Euclidian normal vector. It turns a count of the number of each word in our document into a unit vector in N-dimensional space, where N is, believe it or not, the number of individual words that we have in our dictionary (above).

That means that, in this case, we have a "dataset" matrix with hundreds of thousands of columns in it!

The beauty of TD-IDF is that it scales the words according to how frequent or rare they are. Words that appear a lot in your text but also appear a lot in the rest of the corpus are weighed less heavily than words that appear a lot in your text but more rarely outside of it.

Thus we first use gensim to convert our words to word incidence vectors...

In [35]:
corpus = [dictionary.doc2bow(text) for text in stemmed_title_common_words_nonnull]

In [36]:
stemmed_title_common_words_nonnull[0], corpus[0]

(['relev', 'sort', 'for', 'databas', 'search'],
 [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)])

In [37]:
print(stemmed_title_common_words_nonnull[2], corpus[2])

['method', 'and', 'apparatu', 'for', 'provid', 'effici', 'control', 'channel', 'structur', 'wireless', 'commun', 'system'] [(1, 1), (9, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)]


In [38]:
print(stemmed_title_common_words_nonnull[100], corpus[100])

['passeng', 'transport', 'system', 'and', 'method', 'for', 'obtain', 'ticket', 'such', 'system'] [(1, 1), (9, 1), (15, 1), (23, 2), (300, 1), (325, 1), (326, 1), (327, 1), (328, 1)]


..then run TfidfModel from gensim on them to turn them into our word vectors!

## Tfidf Model

In [39]:
from gensim.models import TfidfModel

In [40]:
tfidf = TfidfModel(corpus)

Note that gensim doesn't follow the scikit access pattern, if you are familiar with it. It instead (1) defers computations on individual entries until necessary and (2) provides access to data using bracket indexing notation ([]).

By contrast, scikit will run everything immediately by default, provides results using a .values_ attribute, and seperates model initialization from runtime (the latter doesn't occur until you fit() your model).

In [41]:
stemmed_title_common_words_nonnull[0], corpus[0], tfidf[corpus[0]]

(['relev', 'sort', 'for', 'databas', 'search'],
 [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
 [(0, 0.4377475295381456),
  (1, 0.07562061510033051),
  (2, 0.5469009211535774),
  (3, 0.36992777626242423),
  (4, 0.6055670447985132)])

In [42]:
print(stemmed_title_common_words_nonnull[2], corpus[2], tfidf[corpus[2]])

['method', 'and', 'apparatu', 'for', 'provid', 'effici', 'control', 'channel', 'structur', 'wireless', 'commun', 'system'] [(1, 1), (9, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)] [(1, 0.08840592502566112), (9, 0.08992064121432716), (15, 0.07858769583335751), (16, 0.22360915542920912), (17, 0.42309521007730033), (18, 0.3118124008719702), (19, 0.26099879446104235), (20, 0.44892517091974105), (21, 0.3559213131631317), (22, 0.2847423513065349), (23, 0.14769708350864824), (24, 0.3904590429022807)]


With our words suitibly datified, we can now move on to fitting a model. Since our words are now, effectively, a very large dataset, it's possible to use any general purpose classifier to fit it.  For example,We could use a scipy KMeans clustering algorithm to arrive at its topics.

We'll instead use a model specifically adapted to natural language processing from the gensim built-ins, **LsiModel**.

## LSI Model

In [43]:
from gensim.models import LsiModel

Here's how we run it:

In [44]:
corpus_tfidf = tfidf[corpus]
lsi = LsiModel(tfidf[corpus], id2word=dictionary, num_topics=10)
corpus_lsi = lsi[corpus_tfidf]

Here's a printout of what words are important to our various topics. Notice that certain extremely common words, like semiconductor, appear in different positions in multiple classifiers. Also, note that this display is cut off at a certain number of displayed words; in reality the model considers far more than these (you can specify how many to display here, however, using the num_words parameter).

In [45]:
lsi.print_topics(10)

[(0,
  '0.381*"devic" + 0.350*"semiconductor" + 0.298*"method" + 0.281*"and" + 0.229*"for" + 0.227*"system" + 0.194*"," + 0.188*"memori" + 0.183*"circuit" + 0.153*"apparatu"'),
 (1,
  '-0.603*"semiconductor" + -0.354*"devic" + 0.328*"system" + 0.197*"apparatu" + 0.171*"for" + 0.166*"data" + 0.153*"and" + 0.152*"," + -0.146*"manufactur" + -0.132*"form"'),
 (2,
  '-0.718*"circuit" + -0.502*"integr" + 0.162*"system" + 0.122*"data" + 0.114*"semiconductor" + 0.111*"devic" + -0.111*"packag" + 0.101*"commun" + -0.100*"voltag" + 0.096*"network"'),
 (3,
  '0.666*"memori" + -0.347*"commun" + 0.281*"cell" + -0.202*"handl" + -0.175*"semiconductor" + -0.170*"inform" + 0.140*"non-volatil" + -0.125*"wireless" + -0.119*"devic" + 0.114*"form"'),
 (4,
  '-0.560*"," + -0.305*"imag" + -0.263*"apparatu" + 0.235*"system" + 0.231*"handl" + 0.213*"manag" + -0.189*"display" + 0.182*"power" + 0.160*"network" + 0.152*"inform"'),
 (5,
  '-0.575*"fiber" + -0.542*"optic" + -0.275*"cabl" + -0.274*"connector" + 0.175

Here are the scoring outputs for the first five documents:

In [46]:
for scores in corpus_lsi[:5]:
    print(scores)

[(0, 0.025523921497501233), (1, 0.02742137905166831), (2, 0.013944757678545172), (3, 0.0012995427812823302), (4, 0.009984642561234753), (5, 0.0034816726247226927), (6, 0.007081590779086658), (7, 0.0010218310607902725), (8, 0.001715545959477425), (9, -0.00399700710903286)]
[(0, 0.02634505034794532), (1, 0.007172129908303868), (2, -0.0016285344172783296), (3, 0.004574968503089758), (4, 0.0037087043066056456), (5, -0.010196541603278944), (6, -0.0021291613019748883), (7, -0.006346875322403437), (8, 0.0006334734870246986), (9, -0.01634178532986781)]
[(0, 0.2644460156335694), (1, 0.2266817130900064), (2, 0.11107966755741627), (3, -0.1599181763458694), (4, 0.02290172860963049), (5, 0.029976275586513626), (6, 0.20702617213579297), (7, -0.028177523482901024), (8, 0.013766912548891374), (9, -0.07018245636598941)]
[(0, 0.13516972344301473), (1, 0.11687154416760245), (2, 0.030596894262866832), (3, 0.04734394818514877), (4, -0.11523086229584573), (5, 0.046806496255013), (6, -0.044329445044070506), 

Let's use these scores to fetch best-fit classifications for all of our (classifiable) patents:

In [47]:
classifications = [np.argmax(np.asarray(corpus_lsi[i])[:,1]) for i in range(len(stemmed_title_common_words_nonnull))]

In [48]:
topics = pd.DataFrame({'topic': classifications, 'title': classifiable_titles})

Certain topics that our classifier arrives at are much more common than others.

In [49]:
topics['topic'].value_counts()

0    260315
1     42149
3     24722
4     14038
9     10467
8      8978
6      8650
7       435
2       360
5       308
Name: topic, dtype: int64

## Visuals

Let's see what our classes look like.

In [50]:
from IPython.display import display

In [51]:
for i in range(10):
    print("Topic", i + 1)
    display(topics.query('topic == @i').head(5))

Topic 1


Unnamed: 0,topic,title
1,0,"[Method, Of, Treating, Intraoccular, Tissue, P..."
2,0,"[Method, And, Apparatus, For, Providing, An, E..."
3,0,"[Computer-Implemented, Method, ,, System, And,..."
4,0,"[Selecting, Graphical, Component, Types, At, R..."
5,0,"[Removal, Of, High, Molecular, Weight, Aggrega..."


Topic 2


Unnamed: 0,topic,title
0,1,"[Relevance, Sorting, For, Database, Searches]"
19,1,"[Implicit, Searching, For, Mobile, Content]"
30,1,"[Physical, Navigation, Of, A, Mobile, Search, ..."
53,1,"[Call, Control, Server]"
54,1,"[Methods, And, Apparatus, For, Communicating, ..."


Topic 3


Unnamed: 0,topic,title
3541,2,"[Beverage, Capsule]"
8962,2,[Headphones]
10974,2,"[Drag-Type, Casing, Mill/Drill, Bit]"
11254,2,"[Drag-Type, Casing, Mill/Drill, Bit]"
14937,2,"[Beverage, Capsule]"


Topic 4


Unnamed: 0,topic,title
178,3,"[Driver, For, Non-Linear, Displays, Comprising..."
198,3,"[Erasable, And, Programmable, Non-Volatile, Cell]"
232,3,"[Method, And, System, For, Accelerated, Access..."
256,3,"[Two-Dimensional, Data, Memory]"
322,3,"[Error, Correction, Scheme, For, Use, In, Flas..."


Topic 5


Unnamed: 0,topic,title
40,4,"[Resource, Consumption, Reduction, Via, Meetin..."
74,4,"[Power, Converter]"
157,4,"[Antenna, Configuration]"
291,4,"[Data, Carrier, For, Storing, Information, Rep..."
320,4,"[Digital, Rights, Management, Unit, For, A, Di..."


Topic 6


Unnamed: 0,topic,title
285,5,"[Method, Of, Calling, Up, Object-Specific, Inf..."
2050,5,"[Collecting, Information, Before, A, Call]"
2557,5,"[Collecting, Information, Before, A, Call]"
3357,5,"[Collecting, Information, Before, A, Call]"
3963,5,"[Telephony, Usage, Derived, Presence, Informat..."


Topic 7


Unnamed: 0,topic,title
6,6,"[Communication, Station, For, Communication, W..."
18,6,"[Mobile, Search, Substring, Query, Completion]"
20,6,"[Creation, Of, A, Mobile, Search, Suggestion, ..."
21,6,"[Mobile, Pay-Per-Call, Campaign, Creation]"
22,6,"[Mobile, Pay-Per-Call, Campaign, Creation]"


Topic 8


Unnamed: 0,topic,title
1465,7,"[Handling, Complex, Regex, Patterns, Storage-E..."
4077,7,"[Generic, Information, Element]"
4809,7,"[Buddy, Lists, For, Information, Vehicles]"
5255,7,"[Paper, Sheet, Handling, Apparatus]"
9900,7,"[Setting, User-Preference, Information, On, Th..."


Topic 9


Unnamed: 0,topic,title
580,8,"[Trench, Mos, Structure]"
618,8,"[Connector, For, Chip-Card]"
717,8,"[Multitrack, Optical, Disc, Reader]"
909,8,"[Planarising, Damascene, Structures]"
1213,8,"[Coil, Construction]"


Topic 10


Unnamed: 0,topic,title
26,9,"[Mobile, Search, Result, Clustering]"
27,9,"[Mobile, Search, Service, Discovery]"
41,9,"[Methods, ,, Systems, ,, And, Computer, Progra..."
88,9,"[Wireless, Terminal, ,, Wireless, Module, And,..."
133,9,"[Method, Of, Compiling, A, Source, Code, Progr..."
