# cat-AI-log. An AI-based product group allocation system

Capstone project.

Sebastian Thomas @ neue fische Bootcamp Data Science<br />
(datascience at sebastianthomas dot de)

When ordering medicines, hospitals have to deal with a multitude of different article descriptions for identical products. With cat-AI-log, article duplicates and similar articles can be found and articles can be catalogued in product groups using human-assisted artificial intelligence.

![cat-AI-log presentation][photo]

[photo]: cat-AI-log.png "cat-AI-log presentation"

## Data origin
The data was given to me as a final capstone project at the neue fische Bootcamp Data Science. It belong to a local consulting company with which I have cooperated.

## Original features

The instances represent article orders. The given features are as follows:

| feature                   | description                                   | type                  |
|:--------------------------|:----------------------------------------------|:----------------------|
| `ArtikelBezeichnung`      | article description                           | text                  |
| `Menge`                   | unclear, has strange values                   | continuous (numeric)  |
| `WarengruppeBezeichnung`  | product group description, **used as target** | text                  |
| `Mengeneinheit`           | unit of feature `Menge`                       | nominal (categorical) |
| `Datenursprung`           | anonymised origin of instance                 | nominal (categorical) |

# Part 1: Data mining

We import the data set, explore it briefly, drop duplicates and unused features and handle the data types.

## Imports

### Modules, classes and functions

In [None]:
# data
import pandas as pd

# custom modules
from modules.ds import data_type_info, handle_data_types

### Data

We import our data, which is available in a single csv file, in a dataframe.

In [None]:
mira = pd.read_csv('data/mira_raw.csv', sep=';')
mira

## First exploration

We explore the number of instances and features.

In [None]:
mira.shape

The dataframe `mira` has 1024619 rows and 5 columns, i.e. we have 1024619 instances, 1 target (`'WarengruppeBezeichnung'`) and 4 features.

We explore the current data types, the number of unique values and the number of NA values of all features.

In [None]:
data_type_info(mira)

The feature `'ArtikelBezeichnung'` of the 1024619 instances has only 28018 unique values. The features `'Menge'` and `'Mengeneinheit'` have a large amount of NA values.

## Dropping features and instances

In our further exploration, we will not use the features `'Menge'`, `'Mengeneinheit'` and `'Datenursprung'`, so we drop them from our dataframe. Moreover, we drop duplicate instances.

In [None]:
mira.drop(['Menge', 'Mengeneinheit', 'Datenursprung'], axis=1, inplace=True)
mira.drop_duplicates(inplace=True)
mira.shape

There remain 28020 instances.

We investigate those instances, whose value of the feature `'ArtikelBezeichnung'` is equal.

In [None]:
mira[mira.duplicated(subset=['ArtikelBezeichnung'], keep=False)]

We remove the instances with index 19 and 2172.

In [None]:
mira.drop([19, 2172], inplace=True)

## Renaming features
We use English language and rename the features.

In [None]:
mira.rename({'ArtikelBezeichnung': 'article', 'WarengruppeBezeichnung': 'product group'},
            axis=1, inplace=True)

## Handling data types

We cast the data types from `object` to `string`.

In [None]:
handle_data_types(mira, string_features=mira.columns)

## Summary

In [None]:
mira.sample(5, random_state=0)

In [None]:
data_type_info(mira)

## Save data set

In [None]:
mira.to_pickle('data/mira_1.pickle')