# cat-AI-log. An AI-based product group allocation system

Capstone project.

Sebastian Thomas @ neue fische Bootcamp Data Science<br />
(datascience at sebastianthomas dot de)

# Part 2: Data preprocessing

We clean the data and engineer some new features.

## Imports

### Modules, classes and functions

In [None]:
# python object persistence
import joblib

# data
import numpy as np
import pandas as pd

# machine learning
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

# custom modules
from modules.ds import data_type_info
from transformer.cleaning import clean_mira
from transformer.feature_engineering import extract_features_mira, reduce_to_base

### Data

We import our data.

In [None]:
mira = pd.read_pickle('data/mira_1.pickle')
mira.sample(5, random_state=0)

## Data cleaning

### Removing strange initial and final characters
The feature `'article'` has values which begin with `'!'` or begin or end with `'*'`.

In [None]:
print(mira.loc[118, 'article'])
print(mira.loc[423, 'article'])
print(mira.loc[549, 'article'])

To clean the feature `'article'`, we remove these characters (below).

### Replacing tokens
The feature `'article'` contains a lot of abbreviations.

In [None]:
print(mira.loc[51289, 'article'])
print(mira.loc[126987, 'article'])
print(mira.loc[542279, 'article'])
print(mira.loc[8498, 'article'])

To expand an abbreviation (e.g. replace `'Beatm Gerät'` by `'Beatmungsgerät'`) or replace them by an official dosage form abbreviation that can be recognized later (e.g. replace `'Au./Ohr. Tr.'` by `'ATO'`), we use a manually created csv file. This file is imported into a dataframe and transformed into a dictionary.

In [None]:
abbreviations_df = pd.read_csv('data/abbreviations.csv', sep=';')
abbreviations_df.sample(5, random_state=0)

In [None]:
replacements_abbreviations = pd.Series(abbreviations_df['expansion'].values,
                                       index=abbreviations_df['abbreviation']\
                                       .str.replace(r'\.$', '').str.lower()).to_dict()

To replace an inofficial dosage form abbreviation (that can be found on a website by DocMorris) by the official abbreviation (of the Informationsstelle für Arzneispezialitäten), we use another csv file (that was created using the abbreviation files of DM and IFA).

In [None]:
replacements_dm_ifa_df = pd.read_csv('data/replacements_dm_ifa.csv', sep=';')
replacements_dm_ifa_df.sample(5, random_state=0)

In [None]:
replacements_dm_ifa = pd.Series(replacements_dm_ifa_df['abbreviation_ifa'].values,
                                index=replacements_dm_ifa_df['abbreviation_dm']\
                                .str.replace(r'\.$', '').str.lower()).to_dict()

To replace the full spelling of a dosage form by its IFA abbreviation, we use another csv file.

In [None]:
dosage_forms_ifa = pd.read_csv('data/dosage_forms_ifa.csv', sep=';')
dosage_forms_ifa.sample(5, random_state=0)

In [None]:
replacements_dosage_form = pd.Series(dosage_forms_ifa['abbreviation'].values,
                                     index=dosage_forms_ifa['dosage form']\
                                     .str.replace(r'\.$', '').str.lower()).to_dict()

### Cleaner

We clean the feature `'article'` (engineering a feature `'article cleaned'`), using a predefined function `clean_mira`.

In [None]:
replacement_dicts = [replacements_abbreviations, replacements_dm_ifa, replacements_dosage_form]
cleaner = FunctionTransformer(clean_mira, kw_args={'replacement_dicts': replacement_dicts})

mira['article cleaned'] = cleaner.transform(mira['article'])

## Feature engineering

### Extraction of tokens

Some values of the feature `'article cleaned'` contain dosage forms, manufacturers, or a note on laws.

In [None]:
print(mira.loc[15, 'article cleaned'])
print(mira.loc[210, 'article cleaned'])
print(mira.loc[2, 'article cleaned'])

To extract these tokens, we use lists and pandas series'.

In [None]:
manufacturers = pd.read_csv('data/manufacturers.csv', sep=';')
laws = ['11.3', '73.3', '116', '116b', '129', '129a']

### Extraction of manufacturer article numbers

Some values of the feature `'article cleaned'` contain a manufacturer article number at the end.

In [None]:
print(mira.loc[396091, 'article cleaned'])

We extract these manufacturer article numbers (below).

### Extraction of information on additional fee

Some values of the feature `'article cleaned'` contain an information on an addtional fee ("Zusatzentgelt") at the end.

In [None]:
print(mira.loc[124, 'article cleaned'])

### Extraction information on treatment

Some values of the feature `'article cleaned'` contain information on ambulant and/or stationary treatment.

In [None]:
print(mira.loc[0, 'article cleaned'])
print(mira.loc[2, 'article cleaned'])

### Extraction of physical entities

Some values of the feature `'article cleaned'` contain physical entities.

In [None]:
print(mira.loc[1, 'article cleaned']) # mass, volume, mass concentrion
print(mira.loc[3, 'article cleaned']) # percentage
print(mira.loc[43, 'article cleaned']) # count puffs
print(mira.loc[77, 'article cleaned']) # active ingredient percentage
print(mira.loc[594, 'article cleaned']) # mass flow
print(mira.loc[2198, 'article cleaned']) # count
print(mira.loc[2462, 'article cleaned']) # length
print(mira.loc[118490, 'article cleaned']) # volume flow
print(mira.loc[573428, 'article cleaned']) # mass puff concentation

We extract some of these physical entities (below).

### Feature engineerer

We engineer the mentioned features. Moreover, we reduce the feature `'article cleaned'` by removing the mentioned strings, engineering a feature `'article base'`.

In [None]:
token_lists = [dosage_forms_ifa['abbreviation'], manufacturers['manufacturer'], laws]
token_feature_names = ['dosage form', 'manufacturer', 'law']

mira = pd.concat([mira, extract_features_mira(mira['article cleaned'], token_lists, token_feature_names)],
                 axis=1)

In [None]:
reducer = FunctionTransformer(reduce_to_base, kw_args={'token_lists': token_lists})

mira['article base'] = reducer.transform(mira['article cleaned'])

## Summary

In [None]:
mira.sample(5, random_state=0)

In [None]:
data_type_info(mira)

## Save data set

We save the preprocessed data set.

In [None]:
mira.to_pickle('data/mira_2.pickle')

## Save preprocessor

We construct a preprocessor object and save it for later usage in the web app.

In [None]:
preprocessor = make_pipeline(cleaner, reducer)

joblib.dump(preprocessor, 'objects/preprocessor.joblib');