# Data preprocessing

In [1]:
import pandas as pd
import numpy as np

claim_path = "../data/out_claim.zip"
title_path = "../data/out_title.zip"
description_path = "../data/out_descr.zip"

Load all the data into the dataframe `df`. Join all the different dataframes (claims, titles, descriptions) into a unique dataframe.

In [101]:
df = pd.read_csv(claim_path, compression="zip") \
         .drop(columns=["Unnamed: 0", "Language", "PatenType", "PublicationType", "Language", "Part", "Number"]) \
         .rename(columns={"Contents": "claims"}).rename(str.lower, axis="columns") \
         .set_index("patentnumber")

In [102]:
title = pd.read_csv(title_path, compression="zip") \
          .drop(columns=["Unnamed: 0", "Language", "PatenType", "PublicationType", "Language", "Part", "Number", "Date"]) \
          .rename(columns={"Contents": "title"}).rename(str.lower, axis="columns") \
          .set_index("patentnumber")

df = df.join(title)
del title

In [103]:
desc = pd.read_csv(description_path, compression="zip") \
          .drop(columns=["Unnamed: 0", "Language", "PatenType", "PublicationType", "Language", "Part", "Number", "Date"]) \
          .rename(columns={"Contents": "description"}).rename(str.lower, axis="columns") \
          .set_index("patentnumber")

df = df.join(desc)
del desc

In [104]:
df.head()

Unnamed: 0_level_0,date,claims,title,description
patentnumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3000006,2018-02-28,['A complementary metal oxide semiconductor vo...,"ALL-CMOS, LOW-VOLTAGE, WIDE-TEMPERATURE RANGE,...","{'BACKGROUND OF THE INVENTION': [], 'FIELD OF ..."
3000007,2020-07-08,['A method for configuring a user interface of...,SYSTEM AND METHOD FOR OPTIMIZED APPLIANCE CONTROL,"{'BACKGROUND': ['Controlling devices, for exam..."
3000011,2017-05-03,['A method (400) of positioning one or more vi...,BODY-LOCKED PLACEMENT OF AUGMENTED REALITY OBJ...,{'BACKGROUND': ['An augmented reality computin...
3000012,2019-05-01,['A method of displaying a schedule in a weara...,METHOD AND APPARATUS FOR DISPLAYING SCHEDULE O...,{'Technical Field': ['The present disclosure r...
3000013,2020-05-06,['A remote controller adapted to interact with...,INTERACTIVE MULTI-TOUCH REMOTE CONTROL,{'BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF TH...


In [105]:
df.shape

(43182, 4)

In [106]:
df = df.dropna()
df.shape

(43182, 4)

We have quite a lot of documents to process and each document contains a lot of text internally, we will process descriptions in batches to allow every machine to handle the amount of data despite the installed memory.

## Summary extraction

Let's try to extract summaries from all documents first.

In [107]:
df["summaries"] = df["description"].apply(
    lambda r: [eval(r)[k] for k in eval(r).keys() if 'summary' in k.lower()],
    convert_dtype=False)

In [112]:
# keep only 1 summary
df = df[df['summaries'].apply(len) == 1]

In [142]:
# explode summaries newlines in 1 text
df['summaries'] = df['summaries'].apply(lambda r: ' '.join(r[0]))

IndexError: string index out of range

In [144]:
df['summary_title'] = df["description"].apply(
    lambda r: [k.lower() for k in eval(r).keys() if 'summary' in k.lower()][0],
    convert_dtype=False)

In [174]:
import re
from functools import reduce
from nltk.corpus import stopwords


REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
GOOD_SYMBOLS_RE = re.compile('[^0-9a-z ]')

def replace_special_characters(text: str) -> str:
    """
    Replaces special characters, such as paranthesis,
    with spacing character
    """
    return REPLACE_BY_SPACE_RE.sub(' ', text)

def filter_out_uncommon_symbols(text: str) -> str:
    """
    Removes any special character that is not in the
    good symbols list (check regular expression)
    """
    return GOOD_SYMBOLS_RE.sub('', text)

def filter_out_stopwords(text):
    return ' '.join([w for w in text.split() if w not in stopwords.words('english')])

def strip_text(text: str) -> str:
    """
    Removes any left or right spacing (including carriage return) from text.
    Example:
    Input: '  This assignment is cool\n'
    Output: 'This assignment is cool'
    """
    return text.strip()

PREPROCESSING_PIPELINE = [
  replace_special_characters,
  filter_out_uncommon_symbols,
  strip_text
]

# Anchor method

def text_prepare(text: str,
                 filter_methods = None) -> str:
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """
    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

In [155]:
good_headings = ['summary']

In [172]:
df = df[df['summary_title'].apply(lambda r: r in good_headings)]

## Dataset creation

In [183]:
dataset = df[['summaries', 'claims']]

In [184]:
dataset['claims'] = dataset['claims'].apply(lambda r: eval(r)[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['claims'] = dataset['claims'].apply(lambda r: eval(r)[0])


In [185]:
import re
from functools import reduce

REPLACE_BY_SPACE_RE = re.compile('\(\d+\)')

def filter_out_numbers(text: str) -> str:
    return REPLACE_BY_SPACE_RE.sub('', text)

def lower_text(text: str) -> str:
    return text.lower()

def strip_text(text: str) -> str:
    return text.strip()

PREPROCESSING_PIPELINE = [
  strip_text,
  lower_text,
  filter_out_numbers,
]

# Anchor method

def dataset_text_prepare(text: str,
                 filter_methods = None) -> str:
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """
    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

In [186]:
dataset['summaries'] = dataset['summaries'].apply(dataset_text_prepare)
dataset['claims'] = dataset['claims'].apply(dataset_text_prepare)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['summaries'] = dataset['summaries'].apply(dataset_text_prepare)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset['claims'] = dataset['claims'].apply(dataset_text_prepare)


In [187]:
dataset

Unnamed: 0_level_0,summaries,claims
patentnumber,Unnamed: 1_level_1,Unnamed: 2_level_1
3000011,embodiments are disclosed that relate to posit...,a method of positioning one or more virtual o...
3000020,embodiments of the present technology relate t...,a system for presenting a mixed reality exper...
3000024,"a ""just in time"" or as-needed feedback-driven ...",a system comprising:a computing device with a...
3000026,the invention provides a computer-implemented ...,"a computer-implemented method, comprising:dete..."
3000033,this summary introduces selected concepts in s...,a computer-implemented process performed by a ...
...,...,...
3099954,it is an aim of the present disclosure to prov...,a magnetorheological fluid clutch apparatus c...
3099961,it is an object of the present technology to a...,a method of operating a vehicle at different a...
3099975,"a wall assembly for a gas turbine engine, acco...","a wall assembly for a gas turbine engine , co..."
3099979,"in a first, general aspect, a gas turbine comb...","a gas turbine combustor dome assembly (200, 30..."


In [153]:
import json
with open('title.json', 'w') as f:
    json.dump(list(titles), f)