# 1. Office actions
Upon receiving a patent application, the patent examiner will issue a certain action on it. 

https://bulkdata.uspto.gov/data/patent/office/actions/bigdata/2017/

1. Create the simplest problem formulation possible on a development set
2. Train our simple problem on the whole data
3. Eventually make the attempt to predict the complete hierarchy of the IPC

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
#!wget http://www.patentsview.org/data/20171226/claim.tsv.zip
!wget http://www.patentsview.org/data/20171226/ipcr.tsv.zip
#!wget http://www.patentsview.org/data/20171226/brf_sum_text.tsv.zip

In [None]:
!unzip -o ipcr.tsv.zip
#!unzip -o claim.tsv.zip
#!unzip -o brf_sum_text.tsv

In [None]:
#adc = pd.read_csv('claim.tsv', nrows=2000, delimiter='\t')

In [3]:
ipcr = pd.read_csv('ipcr.tsv', delimiter='\t', low_memory=False,
                  parse_dates=['action_date', 'ipc_version_indicator'])
ipcr.section.unique()




array(['D', 'H', 'G', 'F', 'A', 'C', 'B', 'E', 'Q', 'M', 'K', 'I', 'N',
       'R', 'O', '6', 'L', 'P', '0', 'S', 'Z', 'J', '2', 'X', 'W', '3',
       'V', '1', '8', 'h', 'g', '4', 'T', '?', 'U', 'c', 'Y', 'b', '9',
       '5', 'e'], dtype=object)

In [25]:
sec_desc_dict = {
'section': ['A','B','C','D','E','F','G','H'],
'sec_desc':  ['HUMAN NECESSITIES',
              'PERFORMING OPERATIONS; TRANSPORTING',
              'CHEMISTRY; METALLURGY','TEXTILES; PAPER','FIXED CONSTRUCTIONS',
              'MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING',
              'PHYSICS','ELECTRICITY']}
sec_desc = pd.DataFrame.from_dict(sec_desc_dict).set_index('section')

According to the IPC document, section should be a letter from A-H. 
http://www.wipo.int/classifications/ipc/ipcpub/?notion=scheme&version=20180101&symbol=none&menulang=en&lang=en&viewmode=f&fipcpc=no&showdeleted=yes&indexes=no&headings=yes&notes=yes&direction=o2n&initial=A&cwid=none&tree=no&searchmode=smart

In [4]:
not_AH_patents = ipcr.loc[~ipcr.section.str.match('[A-H]')]
not_AH_patents.shape[0]/ipcr.shape[0]
ipcr = ipcr.loc[ipcr.section.str.match('[A-H]')]

As this is just 0.01% of the data, we simply discard this non-compliant part.
The next important question is how many patents have multilabels (i.e. more than 1 single section letter assigned to them).

In [30]:
ipcr_s = ipcr.loc[:, ['patent_id', 'section']].drop_duplicates()
multi_s = ipcr_s.groupby('patent_id').count()
labelstats = multi_s.reset_index()
labelstats = labelstats.groupby('section').count()
labelstats = (labelstats/ipcr_s.shape[0])
labelstats.columns = ['% of Patents']
labelstats.index.name='Number of unique sections'
labelstats.style.format({'% of Patents': '{:,.2%}'.format})

Unnamed: 0_level_0,% of Patents
Number of unique sections,Unnamed: 1_level_1
1,75.02%
2,10.81%
3,1.00%
4,0.08%
5,0.01%
6,0.00%
7,0.00%


As seen above, 75% of the data set have only one section per patent.
For starters we will hence at first NOT treat this problem as a multi-class/multi-label problem, but as a simpler multiclass single-label problem.

We will hence proceed to

1. Discard patents that have more than one section assigned in the classification

2. Save these patent ids with their sections in seperate file

3. Create new notebook that identifies the appropriate training data for these patents

In [39]:
ipcr_s.loc[ipcr_s.patent_id.isin(multi_s.query('section == 1').index)].\
   to_csv('ipcr_single_label_sections_only.csv')

In [40]:
ipcr = pd.read_csv('ipcr_single_label_sections_only.csv')
ipcr.info()

As seen above, we still have 5 Mn patents to classify.
Last but not least lets check for any larger skews in the data:

In [31]:
sec_tab = pd.crosstab(index=ipcr_s['section'], columns='% of patents')
sec_tab = (100*sec_tab/sec_tab.sum()).sort_values('% of patents', 
                                                  ascending=False)
sec_tab.join(sec_desc)


Unnamed: 0_level_0,% of patents,sec_desc
section,Unnamed: 1_level_1,Unnamed: 2_level_1
G,24.743841,PHYSICS
H,21.590743,ELECTRICITY
B,15.955769,PERFORMING OPERATIONS; TRANSPORTING
A,13.410776,HUMAN NECESSITIES
C,11.050994,CHEMISTRY; METALLURGY
F,7.556453,MECHANICAL ENGINEERING; LIGHTING; HEATING; WEA...
D,3.15105,TEXTILES; PAPER
E,2.540373,FIXED CONSTRUCTIONS


The category with the lowest percentage is 'FIXED CONSTRUCTIONS' with 2.5%. Please note that this is still more than sufficient, given the fact that 5 Mn x 2.5% = 125,000 patents

# Summary about IPC
1. As of 2018/06/18, the IPC data file has a little less than 6 Million patents from 2006/01/24 - 2017/12/26.


2. A little less than 60% do use IPC version 2006/01/01, 35% use something before that '0000/00/00'

3. After dropping nas, we stand at 8 Mn IPC entries. After dropping duplicates find that classification_level appears to be mutual exclusive
