# Data cleaning steps

Steps will be exemplified by using one single file, note that in `process.py` we perform roughly the same steps for each file in order to build the final dataframe.

In [1]:
import pandas as pd
from lxml import etree, html

In [2]:
data = pd.read_csv('EP3600000.txt', sep='\t', header=None, names = ["PatenType", "PatentNumber", "PublicationType", "Date", "Language", "Part", "Number", "Contents"])

In [3]:
data.head()

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
0,EP,3601685,A1,2020-02-05,de,TITLE,1,VERFAHREN ZUR BESTIMMUNG VON OBJEKTPOSITIONSIN...
1,EP,3601685,A1,2020-02-05,en,TITLE,1,A METHOD FOR DETERMINING OBJECT POSITION INFOR...
2,EP,3601685,A1,2020-02-05,fr,TITLE,1,PROCÉDÉ DE DÉTERMINATION D'INFORMATIONS DE POS...
3,EP,3600528,A1,2020-02-05,de,TITLE,1,BIMODALES GEHÖRSTIMULATIONSSYSTEM
4,EP,3600528,A1,2020-02-05,en,TITLE,1,BIMODAL HEARING STIMULATION SYSTEM


First we remove each document whose language is not english type is not *B1*. *B1* documents, according to [EPO](https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/definitions.html), are those documents that have been granted by the european union.

In [4]:
data = data.loc[data["Language"] == "en"]
data = data.loc[data["PublicationType"] == "B1"]

In [5]:
data.head()

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
151409,EP,3607095,B1,2020-03-25,en,TITLE,1,KIT AND METHOD FOR DETERMINING THE RECEPTIVITY...
151411,EP,3607095,B1,2020-03-25,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001""><b>..."
151412,EP,3607095,B1,2020-03-25,en,CLAIM,1,"<!-- EPO <DP n=""34""> --><claim id=""c-en-01-000..."
151415,EP,3607095,B1,2020-03-25,en,PDFEP,1,https://data.epo.org/publication-server/pdf-do...
181821,EP,3614486,B1,2020-04-08,en,TITLE,1,RADIO FREQUENCY (RF) CONDUCTIVE MEDIUM


We further remove *PDF* links to the online publication.

In [6]:
data = data.loc[data["Part"] != "PDFEP"]

In [7]:
data = data.reset_index(drop=True)
data.head()

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
0,EP,3607095,B1,2020-03-25,en,TITLE,1,KIT AND METHOD FOR DETERMINING THE RECEPTIVITY...
1,EP,3607095,B1,2020-03-25,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001""><b>..."
2,EP,3607095,B1,2020-03-25,en,CLAIM,1,"<!-- EPO <DP n=""34""> --><claim id=""c-en-01-000..."
3,EP,3614486,B1,2020-04-08,en,TITLE,1,RADIO FREQUENCY (RF) CONDUCTIVE MEDIUM
4,EP,3614486,B1,2020-04-08,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001"">BAC..."


In order to store data in a more efficient we decided to split the whole dataset into 3 different dataframes:
 * `df_Descr` containing data concerning the description of each patent
 * `df_Title` containing data concerning the title of each patent
 * `df_Claim` containing data concerning the claims of each patent
 
We will further preprocess the three different dataframes as each dataframe contains data in a different data-structure.

In [52]:
df_Descr = data[data["Part"] == "DESCR"].reset_index(drop=True)

We noticed that some documents do not present any type of description, which is indeed fundamental in generating claims. This is most probably due to the fact that the patent document hasn't been digitalized and only its pdf version is referenced.

We remove these inconsistencies by getting all the documents that **do** contain a description and is it to filter *empty* documents.

In [53]:
# get all the patents with a description
descr_patents = df_Descr['PatentNumber'].unique()
descr_df = data[data['PatentNumber'].isin(descr_patents)]

# extract title and claim for those patents
df_Titles = descr_df[descr_df["Part"] == "TITLE"].reset_index(drop=True)
df_Claim = descr_df[descr_df["Part"] == "CLAIM"].reset_index(drop=True)

In [73]:
df_Titles

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
0,EP,3607095,B1,2020-03-25,en,TITLE,1,KIT AND METHOD FOR DETERMINING THE RECEPTIVITY...
1,EP,3614486,B1,2020-04-08,en,TITLE,1,RADIO FREQUENCY (RF) CONDUCTIVE MEDIUM
2,EP,3601957,B1,2020-04-15,en,TITLE,1,SENSOR PACKAGE
3,EP,3603117,B1,2020-04-29,en,TITLE,1,TOOLS FOR ARRANGING VOICE COIL LEADOUTS IN A M...
4,EP,3601435,B1,2020-04-29,en,TITLE,1,LIGHT SCATTERING POLYMERIC COMPOSITION WITH IM...
...,...,...,...,...,...,...,...,...
1191,EP,3618411,B1,2021-02-03,en,TITLE,1,MOBILE TERMINAL AND CONTROLLING METHOD THEREOF
1192,EP,3627901,B1,2021-02-03,en,TITLE,1,MANAGING CONNECTION RETRIES DUE TO ACCESS CLAS...
1193,EP,3656354,B1,2021-02-03,en,TITLE,1,DEVICE FOR ENDOVASCULAR AORTIC REPAIR
1194,EP,3664341,B1,2021-02-03,en,TITLE,1,METHOD FOR TRANSMITTING ACK/NACK IN WIRELESS C...


In [56]:
df_Descr.head()

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
0,EP,3607095,B1,2020-03-25,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001""><b>..."
1,EP,3614486,B1,2020-04-08,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001"">BAC..."
2,EP,3601957,B1,2020-04-15,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001""><u>..."
3,EP,3603117,B1,2020-04-29,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001""><b>..."
4,EP,3601435,B1,2020-04-29,en,DESCR,1,"<!-- EPO <DP n=""1""> --><heading id=""h0001""><b>..."


In [57]:
df_Claim.head()

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
0,EP,3607095,B1,2020-03-25,en,CLAIM,1,"<!-- EPO <DP n=""34""> --><claim id=""c-en-01-000..."
1,EP,3614486,B1,2020-04-08,en,CLAIM,1,"<!-- EPO <DP n=""8""> --><claim id=""c-en-01-0001..."
2,EP,3601957,B1,2020-04-15,en,CLAIM,1,"<!-- EPO <DP n=""31""> --><claim id=""c-en-01-000..."
3,EP,3603117,B1,2020-04-29,en,CLAIM,1,"<!-- EPO <DP n=""13""> --><claim id=""c-en-01-000..."
4,EP,3601435,B1,2020-04-29,en,CLAIM,1,"<!-- EPO <DP n=""29""> --><claim id=""c-en-01-000..."


## Description data preprocessing

Description are presented as xml documents, which sometimes happens to contain mixed html tags and unclosed tags.
Each patent generally presents different sections. These are identified by an heading and a body, which is composed of an arbitrary number of elements.

For instance one description might look like:
```
<heading>Heading 1</heading>
<p>...</p>
<p>...<p>

.
.
.

<heading>Heading N</heading>
.
.
.
```

We parse each description in a dictionary (where the key is the section name and the value the section's content) so that we can later refer to different sections for different uses.
Whenever we find out that a document contains plain-text without any section-structure we will name the section `Heandingless description`, to mantain consistency with othet documents.

In [59]:
def parse_desc(xml):
    # parse description as html content
    description = html.fromstring(xml)
        
    sections = dict()
    # find all headings in the description
    headings = [e.tag == "heading" for e in description]
    if True in headings:
        # remove each leading element whose not an heading from the description
        # mostly comments document specifications
        description = description[headings.index(True):]
        
        # loop through all the elements in the description
        i = 0
        section_i = 0
        cur_section = None
        while i < len(description):
          if description[i].tag == "heading":
            # when an heading is met change section
            cur_section = description[i].text_content()
            sections[cur_section] = list()
          elif description[i].tag is not etree.Comment:
            # if the current element is not a comment add it to the current section
            sections[cur_section].append(description[i].text_content())
          i += 1
    else:
        # if no heading is contained then add a constant heading
        sections['Heandingless description'] = description.text_content()

    return sections

# parse all the documents
df_Descr["Contents"] = df_Descr["Contents"].apply(parse_desc)

In [67]:
df_Descr.head()

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
0,EP,3607095,B1,2020-03-25,en,DESCR,1,{'FIELD OF THE INVENTION': ['The present inven...
1,EP,3614486,B1,2020-04-08,en,DESCR,1,{'BACKGROUND': ['Electromagnetic waves or elec...
2,EP,3601957,B1,2020-04-15,en,DESCR,1,{'Field of the Invention': ['The present inven...
3,EP,3603117,B1,2020-04-29,en,DESCR,1,{'RELATED APPLICATION': ['This application cla...
4,EP,3601435,B1,2020-04-29,en,DESCR,1,{'Field of the invention': ['The present inven...


## Claim data preprocessing

Claim are presented as xml documents too.
Each patent presents several claims, identified by the `claim` tag (i.e. `<claim>...</claim>`).
We can therefore easily extract the textual content of each claim, obtaining a list of claims for each document.

In [65]:
def parse_claim(xml):
    tree = html.fromstring(xml)
    claims = tree.xpath("//claim")
    claims = [''.join(c.text_content()) for c in claims]
    return claims

df_Claim["Contents"] = df_Claim["Contents"].apply(parse_claim)

In [66]:
df_Claim.head()

Unnamed: 0,PatenType,PatentNumber,PublicationType,Date,Language,Part,Number,Contents
0,EP,3607095,B1,2020-03-25,en,CLAIM,1,[A kit for determining the receptivity status ...
1,EP,3614486,B1,2020-04-08,en,CLAIM,1,[A radio frequency (RF) device comprising:a st...
2,EP,3601957,B1,2020-04-15,en,CLAIM,1,"[A sensor package, comprisinga sensor chip (3)..."
3,EP,3603117,B1,2020-04-29,en,CLAIM,1,[A tool for arranging voice coil leadouts in a...
4,EP,3601435,B1,2020-04-29,en,CLAIM,1,"[A polymeric composition, comprising:from 90.0..."
