In this notebook, the raw dataset is cleaned form all samples that contain F-Terms that are not defined (by the previously extracted f-term definitions) or corrupted (missing viewpoint or number). 

The Cleaned dataset is than saved sample by sample in .data/dataset_samples 
Each sample is a txt file containing the abstract and the f_terms separated by a special token.

# Imports

In [6]:
# Own Packages
from Masterarbeit_utils import dataset_utils, model_utils
from Masterarbeit_utils.dataset_utils import load_parquet_to_dask, LabelEmbedding

# Site-Packages
import torch 
from torch.utils.data import Dataset
import dask
from dask import dataframe as dd
import pandas as pd
import numpy as np
import re
import csv
import os
import pickle as pk
import cProfile
import random

2023-11-24 16:19:52.338873: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-24 16:19:52.369463: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
choices = ['calculate all', 'ask for userinput', 'just calculate needed']
calculation_profile =  choices[2]
calculation_profile

'just calculate needed'

# Parameters

In [8]:
# Path to the parquett-file of the raw dataset
raw_dataset_path = 'data/JPO_patents_abstracts_fterms'

# Path to the directory in which pickle-files are dumped
dump_dir = r'PK_DUMP'
if not os.path.isdir(dump_dir):
    os.mkdirs(dump_dir)

# Path where the cleaned data is saved. 
output_path_drop_undefined = r'data/drop_undefined'
# Path where the cleaned and defined samples will be stored as individual files.
files_path = "data/dataset_samples"

# Start f-term Token (special token which shows the model that now the prediction of f-terms begins)
start_f_term_token = '<START F-TERMS>'

# Number of occurrences of F-Term the F-Term will be at least upsampled to
upsample_to = 5
# Valdiation Split Percentage
train_val_split = 0.01
# If a patent has less F-terms than drop_on_single F-Terms and also contains a F-Term that 
# occurrs just once in the dataset, it is dropped.
drop_on_single = 5
# maximum number of tokens allowd per sample
max_total_tokens = 370
# maximum allowed F-Terms per patent
max_f_terms = 50


# Loading the Raw Dataset and Dropping Lines with Corrupted F-Terms

In [9]:
# Cleaning the Dataset.
data = dataset_utils.clean_df(raw_dataset_path)
data.head()

Unnamed: 0,Sample
0,PURPOSE:A process for producing moisture absor...
1,PURPOSE:To improve transmission characteristic...
2,PURPOSE:To enhance a breakwater function resul...
3,PURPOSE:To enhance the cooling efficiency and ...
4,PURPOSE:To prevent the leakage of fluid by a m...


# Extracting the unique f-terms

### This is done to extract a list of F-Terms which should be added as tokens to the tokenizer
### To recalculate the unique f-terms the f_terms_in_ds_dir.pk file must be first deleted from the PK_DUMP folder

In [10]:
# Run this cell only on your first run of this notebook, it takes really long. 
# All outputs will be saved and can be loaded from disk.

#In this dict all unique f-terms will be combined with a counting number

if not os.path.isfile(f'{dump_dir}/f_terms_in_ds_dir.pk'):
    f_term_dict = {}
    def extract_f_terms(line):
        return line.split(start_f_term_token)[1]
    
    raw_f_terms = data['Sample'].apply(extract_f_terms, meta=('extract_f_terms', 'str'))
    for i, row in enumerate(raw_f_terms):
        
        if len(row) <9:
            continue
        f_terms = row.split(',')[:-1]
    
        for f_term in f_terms:
            try: 
                _ = f_term_dict[f_term]
            except KeyError:
                f_term_dict[f_term] = i
        
        if i%10000 == 0: 
            print(f'Processed {i} Samples, found {len(f_term_dict.keys())} uniqe F-terms', end='\r')
            
    
    # Saving the dict
    with open(f'{dump_dir}/f_terms_in_ds_dir.pk', 'wb') as f:
        pk.dump(f_term_dict, f)


There are about 450000 F-Terms in the dataset, 360000 F-Terms are expected. This is because there are many corrupted F-Terms in the dataset, which miss some parts like the viewpoint or the number.
There are also a lot of F-Terms which are not used anymore. 

# Loading the outputs

In [11]:
# Loading the dict containing all unique f-terms in the dataset

with open(f'{dump_dir}/f_terms_in_ds_dir.pk', 'rb') as f:
    f_term_dict = pk.load(f)

In [12]:
# Loading a dict, which contains all uniqe f-terms with crawled definitions
# This dict will be created when the Dataset Analysis is run

with open(f'{dump_dir}/f_term_dict.pk', 'rb') as f:
    definitions = pk.load(f)

## Creating a List With All F-Terms That Are Not Defined by the Definitions Dict

In [13]:
# Checking for which f-term form the dataset a f-term definition is present
exceptions = {}
exceptions_l = 0
for i, key in enumerate(f_term_dict.keys()):
    try: 
        _ = definitions[key]
        exceptions[key] = 0
    except KeyError:
        exceptions[key] = 1
        exceptions_l += 1
print(f'There are {exceptions_l} F-Terms in the dataset, which have no definition in the definitions dict!')

There are 74205 F-Terms in the dataset, which have no definition in the definitions dict!


# Dropping all Patents From the Dataset, Which Contain Undefined F-Terms.

In [14]:
def extract_f_terms(line, Stoken : str='<START F-TERMS>'):
    """
    This function should be applied to a dask dataframe with already cleaned data from the dataset_utils.clean_and_save_df function.
    
    This function extracts all f_terms from a
    """
    line_data = line
    f_term_string = line_data.split(Stoken)[-1]
    f_terms = f_term_string.split(',')[:-1]
    return f_terms
    

def drop_undefined(clean_data: dask.dataframe, exceptions_dict: dict):
    """
    This function should be applied to a dask dataframe which was cleaned by the clean_and_save_df function and drops each row, which contains undefined_f_terms
    """
    def test_f_terms(line, exceptions_dict: dict=exceptions_dict):
        """
        This function should be applied to the extracted f-term column of a df.
        This function will return True if all F-Terms are defined and will return False if any F-Term in a row is undefined
        """
        res = [exceptions_dict[l] for l in line]
        res_sum = sum(res)
        return not bool(res_sum)
    
    clean_data['F_Terms'] = clean_data['Sample'].apply(extract_f_terms, meta=('F_Terms', 'str'))
    clean_data['F_Terms'] = clean_data['F_Terms'].apply(test_f_terms, meta=('F_Terms', 'bool'))
    defined_data = clean_data[['Sample']].loc[clean_data['F_Terms']]
    return defined_data
    



In [15]:
defined_data = drop_undefined(data, exceptions)
defined_data.head()

Unnamed: 0,Sample
0,PURPOSE:A process for producing moisture absor...
1,PURPOSE:To improve transmission characteristic...
2,PURPOSE:To enhance a breakwater function resul...
3,PURPOSE:To enhance the cooling efficiency and ...
6,PURPOSE:To improve high speed running characte...


# Dropping Samples that Contain to Many Tokens or F-Terms

In [16]:
tokenizer = model_utils.get_tokenizer(dump_dir)

def get_n_tokens(sample):
    """
    This function retruns True if the sample has less than the maximum allowed number of tokens and 
    False if it has more.
    """
    l =  len(tokenizer(sample)['input_ids'])
    return l < max_total_tokens

def get_n_f_terms(sample):
    """
    This function returns True if the sample has less than the maximum allowed number of tokens and False if it has more
    """
    l = len(extract_f_terms(sample))
    return l < max_f_terms

defined_data['n_tokens'] = defined_data['Sample'].map(get_n_tokens, meta=('n_tokens', 'bool'))
defined_data['n_f_terms'] = defined_data['Sample'].map(get_n_f_terms, meta=('n_f_terms', 'bool'))
defined_data['keep'] = defined_data['n_tokens'] * defined_data['n_f_terms']
defined_data = defined_data[['Sample']].loc[defined_data['keep']]

In [17]:
defined_data = load_parquet_to_dask('data/tmp/defined_data')

In [18]:
len(defined_data)

7478671

# Dropping all Samples with Low Occurrence

All Samples which contain F-Terms which only occurr once in the dataset are dropped, because these samples would be either in the train, or in the validation dataset, both scenarios are not ideal.

# Counting the Occurrence of each F-Term in the dataset

In [19]:
# Run this cell just once it takes really long and all the outputs are also saved on disk
# after the first run.
i = input(
"""Warning this cell takes a long time,
after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y'""")
if i == 'y':

    # LabelEmbedding Instance (see Masterarbeit_utils.dataset_utils)
    label_embedding = LabelEmbedding()
    for i, sample in enumerate(defined_data['Sample']):
        f_terms = extract_f_terms(sample)
        [label_embedding(f_term) for f_term in f_terms]
        if i%1000 == 0:
            print(f'Processed {i} samples!', end='\r')
    
    # Saving the Label Embeddings
    with open(f'{dump_dir}/label_embedding_defined_data.pk', 'wb') as f:
        pk.dump(label_embedding, f)



after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y' n


In [22]:
def drop_single_fterms(data: dask.dataframe, label_embedding: LabelEmbedding = None):
    """
    This function should be applied to the dask.Dataframe dataset after the patents, which contain 
    undefined F-Terms were dropped.
    This function extracts the F-Terms from the dataset and drops each patent less than 5 F-Terms and with a F-Term 
    which occurrs just once in the whole dataset.
    For each patent that has more than 5 F-Terms and also contains a F-Term with just one 
    occurrence, the F-Term is removed, because it will not have a high influence on the 
    meaning of the patent.

    :data: dataframe cleaned by drop_undefined function
    :label_embedding: LabelEmbedding instance used on the defined-data dataframe
                          It is used to check the occurrence of a patent.
    """
    if label_embedding is None:
        # loading the label embeddings
        with open(f'{dump_dir}/label_embedding_defined_data.pk', 'rb') as f:
            label_embedding = pk.load(f)
    # Setting all label occurrences larger then one to zero
    label_embedding.occurrence = np.array(label_embedding.occurrence)
    label_embedding.occurrence[label_embedding.occurrence > 1] = 0

    def check_single(f_terms: str, label_embedding: LabelEmbedding=label_embedding):
        """
        This function checks if a sample contains a F-Term that occurrs just once in the
        whole dataset.
        It returns True if such an F-Term is found and returns False if all F-Terms occurr 
        more often than that.
        :f_terms:  f_terms extracted by the previously defined extract_f_terms function. 
        :label_embedding: LabelEmbedding instance used on the defined-data dataframe
                          It is used to check the occurrence of a patent.
        """
        f_term_ind = [label_embedding.dict[f_term] for f_term in f_terms]
        f_term_occ = [label_embedding.occurrence[ind] for ind in f_term_ind]
        # iff all F-Terms occurr more than once the sum is 0 if any F-term occurrs just once
        # the sum is larger than zero and thus True
        res = sum(f_term_occ)
        return bool(res)

    def check_len(f_terms: str):
        """
        This function checks if a patent contains more or less F-Terms than the threshold below
        which the patent is removed if it contains patent with an occurrence of 1 in the dataset.

        :f_terms: f_terms extracted by the previously defined extract_f_terms function.
        """
        l = len(f_terms)
        if l > drop_on_single:
            # Patent will not be dropped
            return False
        else:
            # Patent may be dropped if also a low occurring F-Term is found
            return True

    def should_drop(single_f_terms: bool, short_len:bool):
        """
        Returns False if a patent contains a F-Term tha occurrs just once in the dataset and 
        also only a small ammount of F-Terms are present in the sample.
        False = Sample will be dropped
        True = Sample will be kept
        """
        # Patent will be dropped if both conditions are True
        return not all([single_f_terms, short_len])

    def remove_f_terms(sample: str,
                       
                       label_embedding: LabelEmbedding = label_embedding):
        """
        This function removes all f-terms that occurr just once in the dataset from a patent
        if the patent hase more f-terms than "drop_on_single"
        
        :sample: sample from the dataset with abstact, start_f_term_token and f_terms
       
        :label_embedding: LabelEmbedding instance used on the defined-data dataframe
                          It is used to check the occurrence of a patent.
        
        """
        f_terms = extract_f_terms(sample)
        # Embedding the F-Terms to labels
        f_terms_nums = [label_embedding.dict[f_term] for f_term in f_terms]
        # Dropping all high occurring f_terms
        f_terms_nums_low_occ = [n for n in f_terms_nums if label_embedding.occurrence[n] == 1]
        # Converting the embedded F-Terms back to strings
        f_terms_low_occ = [label_embedding.r_dict[n] for n in f_terms_nums_low_occ]
        # removing the f_terms_from the sample
        for f_term in f_terms_low_occ:
            sample = sample.replace(f'{f_term},', '')
        return sample        

    data['F_Terms'] = data['Sample'].apply(extract_f_terms, meta=('F_Terms', 'str'))
    # checking which patent contains low occurring F-Terms
    data['Single'] = data['F_Terms'].apply(check_single, meta=('Single', 'bool'))
    # removing low occurring F-Terms
    data['Sample'] = data['Sample'].apply(remove_f_terms, meta=('Sample', 'str'))
    # Re extract the F-Terms to check if drop below the threshold number of F-Terms
    data['F_Terms'] = data['Sample'].apply(extract_f_terms, meta=('F_Terms', 'str'))
    data['Len'] = data['F_Terms'].apply(check_len, meta=('Len', 'bool'))
    # Creating indices for all patent which should be dropped
    data['Drop'] = ~(data['Single'] * data['Len'])   # logical "or" operator
    no_single_data = data[['Sample']].loc[data['Drop']]
    return no_single_data
    

In [21]:
def iteratively_drop_samples(data: dask.dataframe):
    """
    This function iteratively applies the drop_single_fterms function to the dataset until
    no single f-terms remain in the dataset
    """
    # placeholders
    label_embedding = None
    n_single_fterms = 1
    t = 1
    while n_single_fterms !=0:
        
        data = drop_single_fterms(data, label_embedding)
        # dumping the data in a tmp dir 
        os.makedirs('data/tmp', exist_ok=True)
        data.to_parquet('data/tmp/dataset')
        data = load_parquet_to_dask('data/tmp/dataset')
        # LabelEmbedding Instance (see Masterarbeit_utils.dataset_utils)
        label_embedding = LabelEmbedding()
        for i, sample in enumerate(data['Sample']):
            f_terms = extract_f_terms(sample)
            [label_embedding(f_term) for f_term in f_terms]
            if i%1000 == 0:
                print(f'Processed {i} samples!', end='\r')
        n_single_fterms = len([i for i in label_embedding.occurrence if i ==1])
        print(f'Iteration: {t}, remaining single F-Terms: {n_single_fterms}')
        t += 1
    return data

In [35]:
i = input(
"""Warning this cell takes a long time,
after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y'""")
if i == 'y':
    pd.set_option('display.max_colwidth', 50)
    no_single_data = iteratively_drop_samples(defined_data)


after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y' y


Iteration: 1, remaining single F-Terms: 0


In [16]:
# loading the saved dataset
no_single_data = load_parquet_to_dask('data/tmp/dataset')

# Saving the Dask dataframe as idividual text files
### Todo:
- Upsampeling low occurring F-Terms? Maybe later!
- Making sure, that each F-Term occurrs in the tain and validation set. -- Making sure each F-Term occurs in the train dataset.

In [37]:
i = input(
"""Warning this cell takes a long time,
after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y'""")

if i == 'y':
    label_embedding = LabelEmbedding()
    for i, sample in enumerate(no_single_data['Sample']):
        f_terms = extract_f_terms(sample)
        [label_embedding(f_term) for f_term in f_terms]
        if i%1000 == 0:
            print(f'Processed {i} samples!', end='\r')
            
    # Saving the Label Embeddings
    with open(f'{dump_dir}/label_embedding_no_single.pk', 'wb') as f:
        pk.dump(label_embedding, f)

after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y' y


Processed 7478000 samples!

In [17]:
# converting the samples to a list to shuffle them 
no_single_data = [row for row in no_single_data['Sample']]
# Dask does not offer the functionality to shuffle the rows of a dataframe so 
# i convert the dataset to a list in memory, this works only on machines with high ram
random.shuffle(no_single_data)

In [18]:
# This cell takes really long (90 min)

i = input(
"""Warning this cell takes a long time,
after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y'""")

if i == 'y':
    # creating the train and validation folders
    os.makedirs(f'{files_path}/train', exist_ok=True)
    os.makedirs(f'{files_path}/validation', exist_ok=True)
    
    # creating label embeddings for the dataset without single F-Terms
    # These will be needed to calculate the sampling factors, 
    # each must at least occurr 5 time in the dataset to do this some samples will be 
    # added multiple times to the dataset
    with open(f'{dump_dir}/label_embedding_no_single.pk', 'rb') as f:
        label_embedding = pk.load(f)
    # During Separation of the dataset, the occurrences of the F-Terms are observed in the 
    # train- and validation-dataset and the decision in which dataset the sample is put 
    # is made according to the occurrences and the val-split parameter
    train_occurrences = {key: 0 for key in label_embedding.dict.keys()}
    validation_occurrences = {key: 0 for key in label_embedding.dict.keys()}
    # metrics to investigate the progresss
    n_train = 0
    n_validation = 0
    # target_occurrences 
    targ_train_occ = {key: value*(1-train_val_split) for key, value in label_embedding.dict.items()}
    targ_val_occ = {key: value*(train_val_split) for key, value in label_embedding.dict.items()}
    
    
    for i, sample in enumerate(no_single_data):
        f_terms = extract_f_terms(sample)
        train_occ = np.array([train_occurrences[f_term] for f_term in f_terms])
        #val_occ = np.array([validation_occurrences[f_term] for f_term in f_terms])

        # Each F-Term has to be at least once in the training dataset.
        if 0 in train_occ:
            with open(f'{files_path}/train/{n_train}.txt', 'w', encoding="utf-8") as f:
                f.write(sample)

            for f_term in f_terms:
                train_occurrences[f_term] += 1
            n_train += 1

        else:
            score = n_validation / (n_train+n_validation)
    
            # Validation dataset
            if score < train_val_split:
                # There are more samples in the train-dataset than wanted, 
                # so the sample will be put into the val-dataset
                with open(f'{files_path}/validation/{n_validation}.txt', 'w', encoding="utf-8") as f:
                        f.write(sample)
        
                for f_term in f_terms:
                    validation_occurrences[f_term] += 1
                n_validation += 1
    
            # Training dataset
            if score >= train_val_split:
                # There are more samples in the validation dataset tahn wanted, 
                # so the sample will be put into the train-dataset!
                with open(f'{files_path}/train/{n_train}.txt', 'w', encoding="utf-8") as f:
                        f.write(sample)
    
                for f_term in f_terms:
                    train_occurrences[f_term] += 1
                n_train += 1
                  
        # Plotting progress
        if i%1000 == 0:
            train_occ_all = np.array([v[1] for v in train_occurrences.items()])
            val_occ_all = np.array([v[1] for v in validation_occurrences.items()])
            labels_not_in_train = len(train_occ_all[train_occ_all==0])
            train_occ_all = train_occ_all[val_occ_all !=0]
            val_occ_all = val_occ_all[val_occ_all !=0]
            label_split = np.mean(val_occ_all/(val_occ_all + train_occ_all))
            print(f'''Train-Samples: {n_train:.0f} Val-Samples: {n_validation:.0f} set-split: {train_val_split} current-split: {n_validation/(n_train+n_validation):.4f} label-split: {label_split:.4f} labels not in train: {labels_not_in_train} samples: {i}             ''', end='\r')
    

after the first run of this cell, all outputs are already saved on disk and don't need to
be recalculated. 
If you wan't to proceed wirte 'y' y


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Train-Samples: 7403221 Val-Samples: 74780 set-split: 0.01 current-split: 0.0100 label-split: 0.0168 labels not in train: 0 samples: 7478000                 

In [41]:
with open(f'{dump_dir}/label_embedding_no_single.pk', 'rb') as f:
        label_embedding = pk.load(f)
occ = np.array(label_embedding.occurrence)
occ[occ>1] = 0
print(f'Number of F-Terms with occurrence 1 in dataset: {sum(occ)}')

Number of F-Terms with occurrence 1 in dataset: 0


In [5]:
import os
x = os.listdir('data/dataset_samples/train')
x_2 = os.listdir('data/dataset_samples/validation')
len(x + x_2)

7478671