## Pulling from the PatentsView API

This notebook describes the steps to pull data from the PatentsView API for 1 year, for patents where both first-named assignee and inventor are from the US.

### Import necessary packages.

In [None]:
# interacting with websites and web-APIs
import requests 

# data manipulation
import pandas as pd 

# normalize nested JSON files
from pandas.io.json import json_normalize

### API query for patents.

This is a PatentsView API query for 1 year of data (iterating over 12 months), for those patents where both first-named assignee and inventor are from US.

In [None]:
results = []
for i in range(1,13):  # iterate over 12 months
    # Change year in this string when pulling for another year
    url = 'https://www.patentsview.org/api/patents/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["patent_number","patent_title","patent_abstract","patent_num_cited_by_us_patents","patent_date","app_date","patent_firstnamed_inventor_id","patent_firstnamed_inventor_city","patent_firstnamed_inventor_state","patent_firstnamed_inventor_latitude","patent_firstnamed_inventor_longitude","patent_firstnamed_assignee_id","patent_firstnamed_assignee_city","patent_firstnamed_assignee_state","patent_firstnamed_assignee_latitude","patent_firstnamed_assignee_longitude"]&o={"page":1,"per_page":10000}'
    r = requests.get(url)
    response = r.json()

    # if response returns less than 10,000 elements we know that we have less patents for the month
    # than possible page entries. So we can construct the dataframe using json normalize:
    
    if response['total_patent_count'] < 10000:  
        patents = pd.DataFrame(response['patents']).drop(columns=['applications'])  # only unpack patent path, drop applications column (the application information is saved in the next line) 
        application = json_normalize(response['patents'], record_path='applications', meta='patent_number')  # unpack applications table
        df = patents.merge(application, on='patent_number')  # now merge two tables together (patents and application information)
        results.append(df)  # append results to the list
    
    # else if response returns more than 10,000 elements and less than 20,000 elements, we need to do the same as
    # before (preparing dataframe) but we also need to get the second page and prepare a dataframe and then append the 
    # second page to the first page results:
    
    elif response['total_patent_count'] > 10000 and response['total_patent_count'] < 20000:
        # Get page 1 results.
        patents = pd.DataFrame(response['patents']).drop(columns=['applications'])
        application = json_normalize(response['patents'], record_path='applications', meta='patent_number')
        df_1 = patents.merge(application, on='patent_number')
        
        # Get page 2 results.
        # Change year in this string to match the queried year above.
        url_2 = 'https://www.patentsview.org/api/patents/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["patent_number","patent_title","patent_abstract","patent_num_cited_by_us_patents","patent_date","app_date","patent_firstnamed_inventor_id","patent_firstnamed_inventor_city","patent_firstnamed_inventor_state","patent_firstnamed_inventor_latitude","patent_firstnamed_inventor_longitude","patent_firstnamed_assignee_id","patent_firstnamed_assignee_city","patent_firstnamed_assignee_state","patent_firstnamed_assignee_latitude","patent_firstnamed_assignee_longitude"]&o={"page":2,"per_page":10000}'
        r_2 = requests.get(url_2)
        response_2 = r_2.json()
        patents_2 = pd.DataFrame(response_2['patents']).drop(columns=['applications'])
        application_2 = json_normalize(response_2['patents'], record_path='applications', meta='patent_number')
        df_2 = patents_2.merge(application_2, on='patent_number')
        
        df_concat = pd.concat([df_1,df_2])  # concatenate tables from 2 pages of results
        results.append(df_concat)
        
    # in other cases, if there are more than 20,000 response elements, print this: Let's first see if 
    # we actually need to request more than 2 pages.
    
    else:
        print(i,"there are more results")

### Concatenate results into one final dataframe.
The code about saved all the dataframes in a list results. We can combine the entries into one dataframe

In [None]:
# Concatenate final results per year. 
patents_2018 = pd.concat(results)

# Check to drop duplicates
patents_2018 = patents_2018.drop_duplicates()

In [None]:
# Check that all patent dates are from the same year
patents_2018['patent_date'].unique()

In [None]:
# Sort by numberand check data
patents_2018.sort_values(by=['patent_number'], inplace=True)
patents_2018.head()

### API query for inventor names.

Filter by the same criteria as for our patent dataset above (year and location of first-named inventor and assignee) but for the **inventors** endpoint. The query follows the same logic as above.

In [None]:
inventor_names = []
for i in range(1,13):  # iterate over 12 months
    # Change year in this string when pulling for another year
    url = 'https://www.patentsview.org/api/inventors/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["inventor_id","inventor_first_name","inventor_last_name"]&o={"page":1,"per_page":10000}'
    r = requests.get(url)
    response = r.json()

    # if response returns less than 10,000 elements, do this:
    if response['total_inventor_count'] < 10000:  
        inventor_names.append(pd.DataFrame(response['inventors']))
    
    # else if response returns more than 10,000 elements and less than 20,000 elements, do this:
    elif response['total_inventor_count'] > 10000 and response['total_inventor_count'] < 20000:
        # Get page 1 results.
        df_1 = pd.DataFrame(response['inventors'])
        
        # Get page 2 results.
        url_2 = 'https://www.patentsview.org/api/inventors/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["inventor_id","inventor_first_name","inventor_last_name"]&o={"page":2,"per_page":10000}'
        r_2 = requests.get(url_2)
        response_2 = r_2.json()
        
        df_2 = pd.DataFrame(response_2['inventors'])
        
        df_concat = pd.concat([df_1,df_2])  # concatenate tables from 2 pages of results
        inventor_names.append(df_concat)
    
    # else if response returns more than 20,000 elements and less than 30,000 elements, do this:
    elif response['total_inventor_count'] > 20000 and response['total_inventor_count'] < 30000:
        # Get page 1 results.
        df_1 = pd.DataFrame(response['inventors'])
        
        # Get page 2 results.
        url_2 = 'https://www.patentsview.org/api/inventors/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["inventor_id","inventor_first_name","inventor_last_name"]&o={"page":2,"per_page":10000}'
        r_2 = requests.get(url_2)
        response_2 = r_2.json()
        df_2 = pd.DataFrame(response_2['inventors'])
        
        # Get page 3 results.
        url_3 = 'https://www.patentsview.org/api/inventors/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["inventor_id","inventor_first_name","inventor_last_name"]&o={"page":3,"per_page":10000}'
        r_3 = requests.get(url_3)
        response_3 = r_3.json()
        df_3 = pd.DataFrame(response_3['inventors'])
        
        df_concat = pd.concat([df_1,df_2,df_3])  # concatenate tables from 2 pages of results
        inventor_names.append(df_concat)
    
    # else if response returns more than 30,000 elements and less than 40,000 elements, do this:
    elif response['total_inventor_count'] > 30000 and response['total_inventor_count'] < 40000:
        # Get page 1 results.
        df_1 = pd.DataFrame(response['inventors'])
        
        # Get page 2 results.
        url_2 = 'https://www.patentsview.org/api/inventors/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["inventor_id","inventor_first_name","inventor_last_name"]&o={"page":2,"per_page":10000}'
        r_2 = requests.get(url_2)
        response_2 = r_2.json()
        df_2 = pd.DataFrame(response_2['inventors'])
        
        # Get page 3 results.
        url_3 = 'https://www.patentsview.org/api/inventors/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["inventor_id","inventor_first_name","inventor_last_name"]&o={"page":3,"per_page":10000}'
        r_3 = requests.get(url_3)
        response_3 = r_3.json()
        df_3 = pd.DataFrame(response_3['inventors'])
        
        # Get page 4 results.
        url_4 = 'https://www.patentsview.org/api/inventors/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["inventor_id","inventor_first_name","inventor_last_name"]&o={"page":4,"per_page":10000}'
        r_4 = requests.get(url_4)
        response_4 = r_4.json()
        df_4 = pd.DataFrame(response_4['inventors'])
        
        df_concat = pd.concat([df_1,df_2,df_3,df_4]) 
        inventor_names.append(df_concat)
        
    else:
        print(i,"there are more results")

In [None]:
# Combine all inventor names and ids in one dataframe
inventor_names_2018 = pd.concat(inventor_names)

### API query for assignee names.

Filter by the same criteria as for our patent dataset above (year and location of first-named inventor and assignee) but for the **assignees** endpoint. The query follows the same logic as above.

In [None]:
assignee_names = []
for i in range(1,13):  # iterate over 12 months
    # Change year in this string when pulling for another year
    url = 'https://www.patentsview.org/api/assignees/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["assignee_id","assignee_organization"]&o={"page":1,"per_page":10000}'
    r = requests.get(url)
    response = r.json()

    # if response returns less than 10,000 elements, do this:
    if response['total_assignee_count'] < 10000:  
        assignee_names.append(pd.DataFrame(response['assignees']))
    
    # else if response returns more than 10,000 elements and less than 20,000 elements, do this:
    elif response['total_assignee_count'] > 10000 and response['total_assignee_count'] < 20000:
        # Get page 1 results.
        df_1 = pd.DataFrame(response['assignees'])
        
        # Get page 2 results.
        url_2 = 'https://www.patentsview.org/api/assignees/query?q={"_and":[{"_gte":{"patent_date":"2018-' + str(i) + '-01"}},{"_lte":{"patent_date":"2018-' + str(i) + '-31"}},{"patent_firstnamed_assignee_country":"US"},{"patent_firstnamed_inventor_country":"US"}]}&f=["assignee_id","assignee_organization"]&o={"page":2,"per_page":10000}'
        r_2 = requests.get(url_2)
        response_2 = r_2.json()
        
        df_2 = pd.DataFrame(response_2['assignees'])
        
        df_concat = pd.concat([df_1,df_2])  # concatenate tables from 2 pages of results
        inventor_names.append(df_concat)
        
    # in other cases, if there are more than 20,000 response elements, print this:
    
    else:
        print(i,"there are more results")

In [None]:
# Combine all assignee names and ids in one dataframe
assignee_names_2018 = pd.concat(assignee_names)

Now we have three datasets. One with the patent info, one with inventor names and one with assignee name. We can combine them by merging them on the id.

In [None]:
# Rename ids so the merge is easier
patents_2018 = (patents_2018.rename(columns={'patent_firstnamed_inventor_id':'inventor_id','patent_firstnamed_assignee_id'
                                            :'assignee_id'}))

In [None]:
# Merge the names to the patent file
patents_2018_with_names = patents_2018.merge(inventor_names_2018, on='inventor_id').merge(assignee_names_2018,on='assignee_id').drop_duplicates()

In [None]:
# Check the merged dataframe
patents_2018_with_names.tail()

In [None]:
# What columns do we have?
patents_2018_with_names.count()

In [None]:
# Rename our name columns so it is clear that it is the name of the first named inventor/assignee
patents_2018_with_names = (patents_2018_with_names.rename(columns={'inventor_first_name':'patent_firstnamed_inventor_name_first',
                                                                   'inventor_last_name':'patent_firstnamed_inventor_name_last',
                                                                   'assignee_organization':'patent_firstnamed_assignee_organization'} ))

In [None]:
# Include those columns which are needed for the final dataset and in the order that makes sense
patents_2018_finalized = patents_2018_with_names[['patent_number', 'patent_title', 'patent_abstract',
                           'patent_num_cited_by_us_patents', 'patent_date', 'app_date','patent_firstnamed_inventor_name_first',
                            'patent_firstnamed_inventor_name_last','patent_firstnamed_inventor_city','patent_firstnamed_inventor_state',
                           'patent_firstnamed_inventor_latitude','patent_firstnamed_inventor_longitude','patent_firstnamed_assignee_organization',
                            'patent_firstnamed_assignee_city', 'patent_firstnamed_assignee_state', 'patent_firstnamed_assignee_latitude',
                           'patent_firstnamed_assignee_longitude']]

In [None]:
# Check the number of unique patents after merging on names
patents_2018_finalized['patent_number'].nunique()

In [None]:
# Save to a CSV file
patents_2018_finalized.to_csv('patents_2018.csv', index=False)

Now we have our data for 2018. For more years we can eitehr repeat the process above for another year or construct an additional loop around all the code so it will run the same code for every year.

### Bonus: Query using for loop

You can also use a loop to query the inventor or assignee names. For example, from out patents_2018 data we know all the inventor ids, which enables us to fetch the name for each of the inventors that we have in the patent data. Looping through the list of unique ids will take more time, but can also be done (especially, if the number of elements to loop through is not too large).

In [None]:
# get inventor names using a loop
inventors = []
for inventor_id in df['patent_firstnamed_inventor_id'].unique():
    url = 'https://www.patentsview.org/api/inventors/query?q={"inventor_id":"' + str(inventor_id) + '"}&f=["inventor_id","inventor_first_name","inventor_last_name"]'
    r = requests.get(url)
    response = r.json()
    temp = json_normalize(response['inventors'])
    inventors.append(temp)

In [None]:
# get assignee names using a loop
assignees = []
for inventor_id in df['patent_firstnamed_assignee_id'].unique():
    url = 'https://www.patentsview.org/api/assignees/query?q={"assignee_id":"' + str(inventor_id) + '"}&f=["assignee_id","assignee_organization"]'
    r = requests.get(url)
    response = r.json()
    temp = json_normalize(response['assignees'])
    assignees.append(temp)