# Data preprocessing

In [273]:
import pandas as pd
import numpy as np

claim_path = "../data/out/out_claim.csv"
title_path = "../data/out/out_title.csv"
description_path = "../data/out/out_descr.csv"

Load all the data into the dataframe `df`. Join all the different dataframes (claims, titles, descriptions) into a unique dataframe.

In [25]:
df = pd.read_csv(claim_path) \
         .drop(columns=["Unnamed: 0", "Language", "PatenType", "PublicationType", "Language", "Part", "Number"]) \
         .rename(columns={"Contents": "claims"}).rename(str.lower, axis="columns") \
         .set_index("patentnumber")

In [26]:
title = pd.read_csv(title_path) \
          .drop(columns=["Unnamed: 0", "Language", "PatenType", "PublicationType", "Language", "Part", "Number", "Date"]) \
          .rename(columns={"Contents": "title"}).rename(str.lower, axis="columns") \
          .set_index("patentnumber")

df = df.join(title)
del title

In [30]:
df.head()

Unnamed: 0_level_0,date,claims,title
patentnumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3000006,2018-02-28,['A complementary metal oxide semiconductor vo...,"ALL-CMOS, LOW-VOLTAGE, WIDE-TEMPERATURE RANGE,..."
3000007,2020-07-08,['A method for configuring a user interface of...,SYSTEM AND METHOD FOR OPTIMIZED APPLIANCE CONTROL
3000011,2017-05-03,['A method (400) of positioning one or more vi...,BODY-LOCKED PLACEMENT OF AUGMENTED REALITY OBJ...
3000012,2019-05-01,['A method of displaying a schedule in a weara...,METHOD AND APPARATUS FOR DISPLAYING SCHEDULE O...
3000013,2020-05-06,['A remote controller adapted to interact with...,INTERACTIVE MULTI-TOUCH REMOTE CONTROL


In [31]:
df.shape

(43114, 3)

We have quite a lot of documents to process and each document contains a lot of text internally, we will process descriptions in batches to allow every machine to handle the amount of data despite the installed memory.

## Summary extraction

Let's try to extract summaries from all documents first.

In [295]:
# 5000 documents at time
reader = pd.read_csv(description_path, nrows=1000, chunksize=100)

In [296]:
row = reader.get_chunk(100)

In [297]:
def filter_out_without_summary(chunk):
  # extract summary if present, elese None
  chunk["Contents"] = chunk["Contents"].apply(
    lambda r: eval(r)["SUMMARY"] if "SUMMARY" in eval(r).keys() else None,
    convert_dtype=False)
  
  # drop all documents with no summary
  chunk = chunk.dropna()
  
  # merge all summary text together
  chunk["Contents"] = chunk["Contents"].apply(lambda t: " ".join(t))
  
  return chunk

In [299]:
for chunk in reader:
  with_summary = filter_out_without_summary(reader)

TypeError: 'TextFileReader' object is not subscriptable

In [292]:
idx["Contents"]

0     Certain embodiments relate to a system for dis...
23    The object of the present disclosure is to pro...
30    According to an embodiment of the present inve...
35    Disclosed, in various embodiments, are energy ...
36    The invention is defined by the features of th...
40    This Summary is provided to introduce, in a si...
42    Various implementations of systems, methods an...
43    The Summary is provided to introduce a selecti...
55    In accordance with the present invention, prod...
60    In order to improve flux and quality of the ne...
66    An object is to improve the conservation of en...
67    An object of the present invention is thus to ...
68    There is presented a method for detecting a re...
76    The problems underlying the present invention ...
77    In order to improve flux and quality of the ne...
78    This disclosure is based on the unexpected dis...
82    An object is to enable improved user experienc...
88    A method is implemented by a network devic

In [199]:
row["Contents"].apply(lambda r: eval(r)["SUMMARY"][0] if "SUMMARY" in eval(r).keys() else None, convert_dtype=False)

SyntaxError: invalid syntax (<string>, line 1)

In [109]:
x = map(extract_summary, reader)

In [110]:
list(x)

[]

In [57]:
"summary" in map(str.lower, eval(x.iloc[0]["Contents"]).keys())

True

In [7]:
chunksize = 5e3
with pd.read_csv(description_path, chunksize=chunksize) as reader:
    for chunk in reader:
        print(chunk.head())
        break

AttributeError: __enter__

In [5]:
description = pd.read_csv(description_path) \
                .drop(columns=["Unnamed: 0", "Language", "PatenType", "PublicationType", "Language", "Part", "Number"]) \
                .rename(columns={"Contents": "description"}).rename(str.lower, axis="columns") \
                .set_index("patentnumber")

KeyboardInterrupt: 

In [40]:
title.shape, claims.shape, description.shape

((43080, 2), (43080, 2))