Process Patent Attributes Into a Dataset

Note that the Google online patent dataset has a better but more extensive version of this data. If this works, it may make sense to use their data set which includes more detailed information on type of references, and probably a better version of the claims with better priority information. 

The reason I didn't use this dataset now is that it is much more stuctured because it contains all patents over all time, so much more upfront cost to working with it. The best way to work with it is to find the attributes that are important, have the algorithm chosen and then download the data. 

Another outcome variable I'd like is patent renewal, but I wasn't able to find that directly from the USPTO website. Looking at the Falk and Train paper, the best data source for this is the "the “INPADOC Legal Status Code” and the “US Post Issuance” fields in
Thomson Innovation’s database, which describes official updates to the status of a patent. The update codes are
country-specific, so expired takes on a “1” if the codes “FP”, “FPB1”, “FPB2”, “FPB3”, or “LAPS” are reported in the
“INPADOC Legal Status Code” field or if “EXPI” appears in the “US Post Issuance” field. For expiration dates, this
variable relies on the “INPADOC Legal Status Date” and the “US Post Issuance” fields in Thomson Innovation’s
database. Date information from the “US Post Issuance” field takes precedence if there is a discrepancy between
the two fields. This is because the “US Post-Issuance” field reflects data from the USPTO, updated weekly. See:
http://www.thomsoninnovation.com/tip-innovation/support/help/patent_fields.htm#inpadoc_legal_status ;
http://www.thomsoninnovation.com/tip-innovation/support/help/legalstatus_codes/lsc_us.htm;
http://www.thomsoninnovation.com/tip-innovation/support/help/patent_fields.htm#post_issuance"
Tjos quote is take from footnote 12 on page 5 of patent valuation with forecasts of forward citations

## Additions to the Dataset
* add in inventor age
    * maybe add in the inventor age at the patent invention data using the inventor or assignee dataset
* clean lawyer organization and use that as a categorical

## Dataset Problems
* some patents that should have citations made are missing them - I have flagged these as missing
* some assignment types are empty

In [1]:
import concurrent
import pickle
import json
import funcy
import csv
import os
import gzip
import pandas as pd
import numpy as np

In [2]:
# Split description directory
patents_dataset_dir = '/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips'
patents_dfs_dir = '/nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes'

# I should be story these files in the lustre parallel file system

# I want to stripe this directory over 20 servers to optimize performance 
# /nobackup1/lraymond/patent_data/numerical_patents_datasets
# contains gzip files patents_year_1979.gz to patents_year_2013.gz


In [3]:
def write_large_pickle_file(df, filepath):
    max_bytes = 2 ** 31 - 1
    with open(filepath, 'wb') as f:
        bytes_out = pickle.dumps(df)
        for idx in range(0, len(bytes_out), max_bytes):
            f.write(bytes_out[idx:idx + max_bytes])


def read_large_pickle_file(filepath):
    max_bytes = 2 ** 31 - 1
    bytes_in = bytearray(0)
    input_size = os.path.getsize(filepath)
    with open(filepath, 'rb') as f_in:
        for _ in range(0, input_size, max_bytes):
            bytes_in += f_in.read(max_bytes)

    data2 = pickle.loads(bytes_in)
    return data2

In [4]:
def read_decompress_json(year, patents_dir):
    jsonfilename = os.path.join(patents_dir, 'patents_year_{}.gz'.format(str(year)))
    print(jsonfilename)
    with gzip.GzipFile(jsonfilename, 'r') as fin:
        data = json.loads(fin.read().decode('utf-8'))
    return data


In [5]:
def is_US_utility_patent(patent_dict):
    # filter function for patent sample
    if patent_dict['patent_type'] != 'utility':
        return False
    if patent_dict['patent_firstnamed_inventor_country'] != 'US':
        return False
    assignee_country = patent_dict['patent_firstnamed_assignee_country']
    if assignee_country == 'US' or assignee_country is None:
        return True
    return False

In [6]:
def yield_select_patents(year, patents_dir=patents_dataset_dir):
    # this is a gzipped json file with values data[year] containing year number
    # patent_data contains a list of dictionaries each representing a patent
    # each item in the list is also a list with the first item being the sequence, the last being a number and the 
    # middle is the attribute dictionary
    # the patent dict is strcutred as a list of lists- each containing 1000 patents - so len(t['patent_data'][x][1]) is 1000
    data = read_decompress_json(year, patents_dir)
    print(data['year'])
    for item in data['patent_data']:
        # if a utility patent and US only, yield
        _, d, _ = item
        for patent_dict in d:
            is_sample = is_US_utility_patent(patent_dict)
            if is_sample:
                yield patent_dict

In [7]:
def count_cites_received(patent_dict, year_range):
    yrs = pd.DateOffset(years=year_range)
    grant_date = pd.to_datetime(patent_dict['patent_date'], format='%Y-%m-%d', errors='coerce')
    max_date = grant_date + yrs
    # assuming second item is citation_date
    cite_dates = list(map(
        lambda x: pd.to_datetime(x[1],format='%Y-%m-%d', errors='coerce'), patent_dict['citations_received']))
    return sum(map(lambda x: x <= max_date, cite_dates))
   
    
def is_dict_all_none(input_dict):
    # check if dictionary is totally empty
    non_null_vals = list(filter(None, input_dict.values()))
    return (len(non_null_vals)==0)


def filter_null_dicts(list_dicts):
    # lots of empty filler values in the dictionary attributes so need to filter these out
    _, nonempty = funcy.split(is_dict_all_none, list_dicts)
    nonempty = list(nonempty)
    if len(nonempty) > 0:
        return nonempty
    return None
    

In [21]:
def process_forprior_dict(new1, list_dicts):
    # when in lack of a better option, select the first attribute
    new1['number_forprior'] = len(list_dicts)
    new1['forprior_country'] = list_dicts[0]['forprior_country']
    new1['forprior_date'] = list_dicts[0]['forprior_date']    
    return new1
   

def process_gov_dict(new1, list_dicts):
    new1['number_govint'] = len(list_dicts)
    new1['govint_contract_award_number'] = list_dicts[0]['govint_contract_award_number']
    return new1


def process_app_dict(new1, list_dicts):
    new1['number_apps'] = len(list_dicts)
    new1['app_number'] = list_dicts[0]['app_number']
    new1['app_id'] = list_dicts[0]['app_id']
    # app_type 02 through 28 = Utility application
    new1['app_type'] = list_dicts[0]['app_type']
    new1['app_date'] = list_dicts[0]['app_date']
    return new1


def process_nber_dict(new1, list_dicts):
    new1['number_nbers'] = len(list_dicts)
    new1['nber_category_id'] = list_dicts[0]['nber_category_id']
    new1['nber_category_title'] = list_dicts[0]['nber_category_title']
    new1['nber_subcategory_id'] = list_dicts[0]['nber_subcategory_id']
    new1['nber_subcategory_title'] = list_dicts[0]['nber_subcategory_title']
    return new1


def sort_by_sequence(list_dicts, key_name):
    types_list = sorted(list_dicts, key=lambda x: x['{}_sequence'.format(key_name)], reverse=False)
    return types_list[0]


def convert_assignee_type(assignee_val):
    '''
    Classification of assignee. 2 - US Company or Corporation, 
    3 - Foreign Company or Corporation, 4 - US Individual, 5 - Foreign Individual,
    6 - US Government, 7 - Foreign Government, 8 - Country Government,
    9 - State Government (US). Note: A "1" appearing before any of these codes signifies part interest
    '''
    # returns a string value, rather 
    if pd.isnull(assignee_val):
        return (np.nan, np.nan, np.nan, np.nan)
    try:
        if assignee_val.startswith('1'):
            # if a part interest, categorize that part interest
            int_val = int(assignee_val[-1])
        else:  
            int_val = int(assignee_val)
    except ValueError as e:
        print(e, assignee_val)
        int_val = 10
    finally:
        ASSIGNEE_DICT = {
            2: 'US Company',
            3: 'Foreign Company',
            4: 'US Individual',
            5: 'Foreign Individual',
            6: 'US Government',
            7: 'Foreign Government',
            8: 'US Government',
            9: 'US Government',
            10: 'Missing'
            #add 14
        }
        # returns assignee string name and then is_company and is_gov flag
        mapped_val = ASSIGNEE_DICT.get(int_val, None)
        is_company = bool(int_val < 4)
        if is_company:
            return (mapped_val, int(is_company), 0, 0)
        is_gov = bool((int_val > 5) & (int_val < 10))
        if is_gov:
            return (mapped_val, 0, int(is_gov), 0)
        # otherwise and individual
        return (mapped_val, 0, 0, bool((int_val < 6) & (int_val > 3)))

In [9]:
def process_assignee_dict(new1, list_dicts):
    new1['number_assignees_sequence'] = max(map(lambda x: x['assignee_sequence'], list_dicts))
    new1['number_assignees'] = len(list_dicts)
    # sort dictionaries by sequence
    first = sort_by_sequence(list_dicts, 'assignee')
    raw_assignee = first['assignee_type']
    str_val, is_comp, is_gov, is_ind = convert_assignee_type(raw_assignee)
    new1['assignee_type'] = str_val
    new1['assignee_is_company'] = is_comp
    new1['assignee_is_gov'] = is_gov
    new1['assignee_is_ind'] = is_ind
    new1['assignee_ids']  = list(filter(None, map(lambda x: x['assignee_id'], list_dicts)))
    return new1

def process_inventor_dict(new1, list_dicts):
    first = sort_by_sequence(list_dicts, 'inventor')
    new1['inventor_total_num_patents']  = first['inventor_total_num_patents']
    new1['number_inventors_sequence'] = max(map(lambda x: x['inventor_sequence'], list_dicts))
    new1['number_inventors'] = len(list_dicts)
    new1['inventor_ids']  = list(filter(None, map(lambda x: x['inventor_id'], list_dicts)))
    return new1

def process_lawyers_dict(new1, list_dicts):
    first = sort_by_sequence(list_dicts, 'lawyer')
    new1['lawyer_total_num_assignees']  = first['lawyer_total_num_assignees']
    new1['lawyer_total_num_inventors']  = first['lawyer_total_num_inventors']
    new1['lawyer_total_num_patents']  = first['lawyer_total_num_patents']
    new1['lawyer_organization']  = first['lawyer_organization']
    new1['number_lawyers_sequence'] = max(map(lambda x: x['lawyer_sequence'], list_dicts))
    new1['number_lawyers'] = len(list_dicts)
    new1['lawyer_ids']  = list(filter(None, map(lambda x: x['lawyer_id'], list_dicts)))
    return new1

def process_examiners_dict(new1, list_dicts):
     # examiners have roles - primary or assistant
    new1['number_examiners'] = len(list_dicts)
    new1['number_primary_examiners'] = len(list(filter(lambda x: x['examiner_role']=='primary', list_dicts)))
    new1['number_assistant_examiners'] = len(list(filter(lambda x: x['examiner_role']=='assistant', list_dicts)))
    new1['number_other_examiners'] = len(list(filter(
        lambda x: x['examiner_role'] not in ('assistant', 'primary'), list_dicts)))
    new1['examiner_ids']  = list(filter(None, map(lambda x: x['examiner_id'], list_dicts)))
    return new1


def process_cited_dict(new1, list_dicts):
    cited_nums = list(map(
        lambda x: x['cited_patent_number'], list_dicts))
    cited_dates  = list(map(
        lambda x: x['cited_patent_date'], list_dicts))
    cited_titles = list(map(
        lambda x: x['cited_patent_title'], list_dicts)) 
    new1['citations_made'] = list(zip(cited_nums, cited_dates, cited_titles))
    return new1


def process_citing_dict(new1, list_dicts):
    # these are patents that cite this specific patent in the future
    citing_nums = list(map(
        lambda x: x['citedby_patent_number'], list_dicts))
    citing_dates  = list(map(
        lambda x: x['citedby_patent_date'], list_dicts))
    # these are 'cited by applicant', 'cited by other', None
    citing_cats = list(map(
        lambda x: x['citedby_patent_category'], list_dicts))
    citing_titles = list(map(
        lambda x: x['citedby_patent_title'], list_dicts))
   
    new1['citations_received'] = list(zip(citing_nums, citing_dates, citing_titles, citing_cats))
    new1['20_year_cites'] = count_cites_received(new1, 20)
    new1['5_year_cites'] = count_cites_received(new1, 5)
    new1['10_year_cites'] = count_cites_received(new1, 10)
    new1['15_year_cites'] = count_cites_received(new1, 15)  
    new1['30_year_cites'] = count_cites_received(new1, 30)
    return new1

In [10]:
def subset_patent_dict(raw_dict):
    '''
    patent type -  Category of patent. There are 6 possible type:
    "Defensive Publication" - 509, "Design" - 474736, "Plant" - 21052, "Reissue" - 16416, 
    "Statutory Invention Registration" - 2254, "Utility" - 4910906.
    patent kind - should all be utility
    '''
    keys_to_keep = (
        # patent date is grant date
         'patent_abstract', 'patent_date',
        # assignee info
        'patent_firstnamed_assignee_city',  'patent_firstnamed_assignee_country', 
        'patent_firstnamed_assignee_id', 'patent_firstnamed_assignee_location_id', 
        'patent_firstnamed_assignee_state', 
        # inventor
        'patent_firstnamed_inventor_city', 'patent_firstnamed_inventor_country', 
        'patent_firstnamed_inventor_id', 'patent_firstnamed_inventor_location_id',
        'patent_firstnamed_inventor_state',
        
        'patent_kind', 
        # these are number of citations made by the patent - 
        # eg.  patent_num_us_patent_citations is Number of US patents cited by the selected patent
        'patent_num_cited_by_us_patents', 'patent_num_combined_citations', 
        'patent_num_foreign_citations', 'patent_num_us_application_citations', 
        'patent_num_us_patent_citations',
        'patent_number',
       'patent_title', 'patent_type', 'patent_year'
    )
    # filter dictionary by keys
    new1 = funcy.select_keys(lambda x: x in keys_to_keep, raw_dict)
    # foreign prior information
    
    forprior_dict = filter_null_dicts(raw_dict['foreign_priority'])
    if forprior_dict is not None:
        new1 = process_forprior_dict(new1, forprior_dict)
    
    # government interest
    gov_dict = filter_null_dicts(raw_dict['gov_interests'])
    if gov_dict is not None:
        new1 = process_gov_dict(new1, gov_dict)
    
    # application information  - no info on how app number and app id differ
    app_dict = filter_null_dicts(raw_dict['applications'])
    if app_dict is not None:
        new1 = process_app_dict(new1, app_dict)
    
    # nber category and subcategory
    nber_dict = filter_null_dicts(raw_dict['nbers'])
    if nber_dict is not None:
        new1 = process_nber_dict(new1, nber_dict)
  
    # assignees
    assignee_dict = filter_null_dicts(raw_dict['assignees'])
    if assignee_dict is not None:
        new1 = process_assignee_dict(new1, assignee_dict) 

    # inventor info
    inventor_dict = filter_null_dicts(raw_dict['inventors'])
    if inventor_dict is not None:
        new1 = process_inventor_dict(new1, inventor_dict) 
        
        # lawyer info
    lawyers_dict = filter_null_dicts(raw_dict['lawyers'])
    if lawyers_dict is not None:
        new1 = process_lawyers_dict(new1, lawyers_dict) 
        
        # inventor info
    examiners_dict = filter_null_dicts(raw_dict['examiners'])
    if examiners_dict is not None:
        new1 = process_examiners_dict(new1, examiners_dict) 
  
    # citations info - these are backwards citations
    cited_dict = filter_null_dicts(raw_dict['cited_patents'])
    if cited_dict is not None:
        new1 = process_cited_dict(new1, cited_dict)
        #a bunch of patents appear to be missing citations made 
        new1['missing_citations_made'] = 0
    else:
        new1['missing_citations_made'] = 1

#     # citing patents info
    citing_dict = filter_null_dicts(raw_dict['citedby_patents'])
    if citing_dict is not None:
        new1 = process_citing_dict(new1, citing_dict)
    return new1

    

In [11]:
def generate_cites_rank(df):
    cites_cols = [c for c in df.columns if c.endswith('_year_cites')]
    print(cites_cols)
    for colname in cites_cols:
        yr_rank_name = '{}_rank'.format(colname)
        yr_top1_name = '{}_top1'.format(colname)
        df[yr_rank_name] = df[colname].rank(pct=True)
        df['{}_top1'.format(colname)] = 0
        mask = df[yr_rank_name]>.99
        df.loc[mask, yr_top1_name] = 1
    return df

In [12]:
def get_lat_long(lat_long_string, pos):
    if pd.isnull(lat_long_string):
        return np.nan
    return lat_long_string.split('|')[pos]

def convert_dates(x):
    if pd.isnull(x):
        return pd.NaT
    try:
        return pd.to_datetime(x, format='%Y-%m-%d').date()
    except ValueError as e:
        print(e, x)
        return x

def process_patent_dataframe(df):
    '''
    patent type -  Category of patent. There are 6 possible type:
    "Defensive Publication" - 509, "Design" - 474736, "Plant" - 21052, "Reissue" - 16416, 
    "Statutory Invention Registration" - 2254, "Utility" - 4910906.
    patent kind - should all be utility
    '''
    precompiled_cites = ['patent_num_cited_by_us_patents', 'patent_num_combined_citations', 
                         'patent_num_foreign_citations', 'patent_num_us_application_citations',
                         'patent_num_us_patent_citations']

    int_cols = ['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites',
       '5_year_cites',  'inventor_total_num_patents',  'lawyer_total_num_assignees', 'lawyer_total_num_inventors',
       'lawyer_total_num_patents', 'number_apps',
       'number_assignees', 'number_assignees_sequence', 'nber_category_id', 'nber_subcategory_id',
       'number_assistant_examiners', 'number_examiners', 'number_forprior',
       'number_govint', 'number_inventors', 'number_inventors_sequence',
       'number_lawyers', 'number_lawyers_sequence', 'number_nbers',
       'number_other_examiners', 'number_primary_examiners', 'missing_citations_made', 'patent_year']
    # create a flag for forprio
    df['flag_has_forprior'] = df.forprior_country.apply(lambda x: int(pd.isnull(x) is not True))
    # convert lat long into different fields
    df['patent_firstnamed_assignee_latitude'] = df.patent_firstnamed_assignee_location_id.apply(
        lambda x: get_lat_long(x, 0)).astype(float)
    df['patent_firstnamed_assignee_longitude'] = df.patent_firstnamed_assignee_location_id.apply(
        lambda x: get_lat_long(x, 1)).astype(float)
    df['patent_firstnamed_inventor_latitude'] = df.patent_firstnamed_inventor_location_id.apply(
        lambda x: get_lat_long(x, 0)).astype(float)
    df['patent_firstnamed_inventor_longitude'] = df.patent_firstnamed_inventor_location_id.apply(
        lambda x: get_lat_long(x, 1)).astype(float)
    # convert integer columns to ints
    df[int_cols+precompiled_cites] = df[int_cols+precompiled_cites].fillna(0).astype(int)
    
    # convert two date columns
    df[[
        'patent_date', 'forprior_date', 'app_date']]  = df[[
            'patent_date', 'forprior_date', 'app_date']].applymap(convert_dates)
    
    # drop patent_firstnamed_inventor_location_id, patent_firstnamed_assignee_location_id
    del df['patent_firstnamed_assignee_location_id']
    del df['patent_firstnamed_inventor_location_id']
    
    # add some missing variables for abstract and title
    df['missing_patent_abstract'] = df.patent_abstract.apply(pd.isnull).astype(int)
    df['missing_patent_title'] = df.patent_title.apply(pd.isnull).astype(int)
    return df

In [13]:
def fetch_subset_all_patents(year, df_dir=patents_dfs_dir):
    filename = os.path.join(patents_dataset_dir, 'patents_year_{}.gz'.format(str(year)))
    if not os.path.exists(filename):
        print('Filename ', filename, ' does not exist')
        return None
    print('starting {}'.format(str(year)))
    subset_dicts = list(map(subset_patent_dict, yield_select_patents(year, patents_dataset_dir)))
    # check patent length
    print(len(subset_dicts))
    ranked_df = generate_cites_rank(pd.DataFrame(subset_dicts))
    processed_df = process_patent_dataframe(ranked_df)  
    df_filename = os.path.join(df_dir, 'patent_df_{}.csv'.format(str(year)))
    print('saving to ', df_filename)
    # saving to csv because pickle size limited 
    processed_df.to_csv(df_filename, index=False)
    df_filename = os.path.join(df_dir, 'patent_df_{}.p'.format(str(year)))
    write_large_pickle_file(processed_df, df_filename)
#     return processed_df[['10_year_cites', '10_year_cites_top1', '10_year_cites_rank', 
#                         '5_year_cites', '5_year_cites_top1', '5_year_cites_rank', 
#                          'patent_number', 'patent_year', 
#                          'missing_citations_made', 'missing_patent_abstract', 
#                         'missing_patent_title']]
    
    

In [14]:
# this kills my memory
# pool = concurrent.futures.ProcessPoolExecutor(max_workers=6)

# W = pd.concat(pool.map(fetch_subset_all_patents, range(1985, 1990)), axis=0, join='outer', ignore_index=True)

In [22]:
%%time 
list(map(fetch_subset_all_patents, range(1986, 1990)))

starting 1986
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1986.gz
1986
invalid literal for int() with base 10: '' 
37869
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1986.csv
starting 1987
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1987.gz
1987
43189
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
time data '1984-06-00' doesn't match format specified 1984-06-00
time data '1985-10-00' doesn't match format specified 1985-10-00
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1987.csv
starting 1988
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1988.gz
1988
40212
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
time data '1986-10-00' doesn't match f

[None, None, None, None]

In [23]:
%%time 
list(map(fetch_subset_all_patents, range(1980, 1985)))

starting 1980
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1980.gz
1980
invalid literal for int() with base 10: '' 
invalid literal for int() with base 10: '' 
37058
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1980.csv
starting 1981
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1981.gz
38916
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1981.csv
starting 1982
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1982.gz
1982
invalid literal for int() with base 10: '' 
33560
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/pate

[None, None, None, None, None]

In [24]:
%%time 
list(map(fetch_subset_all_patents, range(1990, 2000)))

starting 1990
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1990.gz
1990
42356
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1990.csv
starting 1991
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1991.gz
1991
50751
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1991.csv
starting 1992
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1992.gz
1992
52717
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
time data '1991-02-00' doesn't match format specified 1991-02-00
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1992.csv
Filename  /nobackup1/lraymond/patent_data/numer

[None, None, None, None, None, None, None, None, None, None]

In [26]:
%%time 
list(map(fetch_subset_all_patents, range(1999, 2010)))

starting 1999
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_1999.gz
1999
83626
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_1999.csv
starting 2000
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_2000.gz
2000
83783
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_2000.csv
starting 2001
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_2001.gz
2001
85939
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_2001.csv
starting 2002
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_2002.gz
2002
85829
[

[None, None, None, None, None, None, None, None, None, None, None]

In [27]:
list(map(fetch_subset_all_patents, range(2010, 2014)))

starting 2010
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_2010.gz
2010
108037
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_2010.csv
starting 2011
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_2011.gz
2011
90163
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_2011.csv
starting 2012
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_2012.gz
2012
89919
['10_year_cites', '15_year_cites', '20_year_cites', '30_year_cites', '5_year_cites']
saving to  /nobackup1/lraymond/patent_data/numerical_patents_datasets/dataframes/patent_df_2012.csv
starting 2013
/nobackup1/lraymond/patent_data/numerical_patents_datasets/gzips/patents_year_2013.gz
2013
89400


[None, None, None, None]

In [None]:
#%%time
#W = pd.concat(map(fetch_subset_all_patents, range(1985, 1990)), axis=0, join='outer', ignore_index=True)

In [None]:
 # save_zipped_pickle(W, index_filename)