# Proof-of-concept: Company name mapping

This jupyter notebook is used as a fully-documented proof-of-concept. 
Different concepts/approaches/solutions are tested, evaluated and documented.


## 1. Presentation of the study case


The case study is quite simple. A list of company names is provided. All these names were manually written. As a result, some of the names refer to the same entity but written differently. For instance, division names are written alongside the company name (i.e. ALTRAN INNOVACION for ALTRAN). Or the names can be written in non-standard way, with non-english characters or with common naming (i.e. 'INC.' or '& CO').The goal is to group the names from the same entity. 

As examples:
- `ALTERYX` is equivalent to `ALTERYX, INC`
- `ALTRAN INNOVACIÃƒÂ“N S.L.` is equivalent to `ALTRAN INNOVACION SOCIEDAD LIMITADA`
- `AMICALE DES ANCIENS DU STADE` is different from `AMICALE JEAN BAPTISTE SALIS`

An algorithm need then to be designed to efficiently map the names to their common entity, possibly written in a standardized way. Several answers can be given for a given name in order to let the user easily choose the proper entity name. 

The case study comes with a dataset of 5000 raw company names, but the algorithm is required to be scalable to larger datasets (e.g. 100,000 names). 

## 2. Analyzing the problem

This study case can be considered as a natural language processing (NLP) problem. Though, unlike a typical NLP problem, this one has the particularity of involving unsupervised machine learning techniques. There is no "true" answer given for each company name in the dataset, and therefore one has to find a way to clusterize the names based on a string metric that should quantify how likely two names are coming from the same entity. The choice of this string metric is therefore probably as important as the choice of the clusterization algorithm. Different metrics could be tested and can be chosen as function as the overall performance of the algorithm (efficiency and speed). 

One additional important step  of the algorithm will be the data pre-processing. The raw company names are given in a non-standard way, with many special characters like non-english characters, commas, dots, quotes, etc. A special care is then needed to clean the data and format them in a suitable  way for the clusterization algorithm. A possible step could be to remove to recurrent names that would not help the string metric (e.g. 'Inc.'). The list of these names could be found by finding the most recurrent sub-string in the dataset and by manually choose the ones that can be removed from the dataset. 

The output of the algorithm need to allow the user to choose himself the correct entity from a list of proposals. This feature will drive the choice of the clusterization algorithm. For instance, a hierarchical clustering will allow to organize the names into dendogram which would give the possibility of having several "layers" of clusterization.

Evaluating the performance of the algorithm is required. Two metrics will have to be assessed: 
 * The efficiency of the algorithm to properly map the name to the correct entity. Since we don't have the true answer, this will need to be manually evaluated. A possible way would be to choose N (e.g. 20) randomly-chosen names, define their true entity and check if it appears in the output of the algorithm. 
 * The speed of the algorithm, and an estimation of the time to process a dataset of 100,000 names.
 
After some research, I found a few solutions that were already proposed for similar problems:
 * [Company Names Standardization using a Fuzzy NLP Approach](https://rajanarya.com/2020/03/28/company-names-standardization-using-a-fuzzy-nlp-approach/) that identifies 'Stop-words' (common terms), removes them from the data, defines a pairwize similarity matrix bazed on a fuzzy matching of strings, clusterizes the names based on this matrix using the Affinity clustering technique and find the most commonly occuring longest common string for each cluster. 
 * [Supplier Name Standardization using Unsupervised Learning](https://medium.com/analytics-vidhya/supplier-name-standardization-using-unsupervised-learning-adb27bed9e0d) that use basically the same approach as above.
 
While these solutions are inspiring, it is not clear if they are fast enough to be used for a large number of names. 

## 3. Pre-processing the data

Importing the necessary librairies first:

In [168]:
import numpy as np
import pandas as pd
import re
import string
import nltk

Extract the raw name dataset as a pandas serie:

In [147]:
dfs = pd.read_excel("Case_study_names_mapping.xlsx", usecols=[0])
data = dfs['Raw name']
data.head(10)

0          "ACCESOS NORMALIZADOS, SL"
1       "ALTAIX ELECTRONICA , S.A.L."
2      "ANTALA LOCKS & ACCESORIS, SL"
3                       "ANTERAL, SL"
4        "ARQUIMEA INGENIERIA , S.L."
5    A & D ENVIRONMENTAL SERVICES LTD
6                      A & L GOODBODY
7                      A & P - LITHOS
8    A & T STATIONERS PRIVATE LIMITED
9              A A LOGISTIK-EQUIPMENT
Name: Raw name, dtype: object

Some encoding issues can already found for some of the names: 

In [148]:
data.iloc[1255]

'AHP Gesellschaft fÃ\x83Â¼r Informationsverarbeitung mbH'

This is probably coming from non-english characters. The data is then uniformized and encoded into utf-8: 

In [149]:
data_clean = data.str.normalize('NFKD').str.encode("ascii",'ignore').str.decode("utf-8","ignore")
data_clean.iloc[1255]

'AHP Gesellschaft fAA14r Informationsverarbeitung mbH'

This preserves the presence of unrecognized characters. This is not perfect, but it will do the job for now.

The characters are then lowered:

In [150]:
data_clean = data_clean.str.lower()
data_clean.iloc[1255]

'ahp gesellschaft faa14r informationsverarbeitung mbh'

Next, the special characters are removed. The list of ponctuation characters are extracted from `string`. The ampersand symbol (i.e. `&`) is removed from this list as this is a common character used for company names (e.g. AT&T) which can be a useful information for the clustering algorithm.    

In [152]:
punctuation_custom = string.punctuation.replace('&','') # !"#$%'()*+,-./:;<=>?@[\]^_`{|}~
data_clean = data_clean.apply(lambda x: re.sub('[%s]'%re.escape(punctuation_custom), '' , x)) ## ponctuation
data_clean = data_clean.apply(lambda x: re.sub('  ', ' ' , x)) ## double space
data_clean.head(10)

0             accesos normalizados sl
1              altaix electronica sal
2         antala locks & accesoris sl
3                          anteral sl
4              arquimea ingenieria sl
5    a & d environmental services ltd
6                      a & l goodbody
7                        a & p lithos
8    a & t stationers private limited
9               a a logistikequipment
Name: Raw name, dtype: object

To find the stop-words in the raw names, the most frequent words are first listed:

In [181]:
words = data_clean.str.cat(sep=' ').split()
word_dist = nltk.FreqDist(words)

rslt = pd.DataFrame(word_dist.most_common(25),
                    columns=['Word', 'Frequency'])
print(rslt)

            Word  Frequency
0           gmbh        798
1              &        263
2            ltd        234
3             de        162
4             co        140
5            inc        131
6    association        126
7             sl        123
8             kg        114
9             sa        113
10           sas         95
11      autohaus         93
12      services         84
13       limited         80
14         group         71
15        france         66
16             a         61
17  technologies         59
18           und         57
19       systems         57
20     solutions         56
21    consulting         55
22           llc         49
23      advanced         45
24          sarl         44


The stop-words are then manually chosen by checking all the cells that contains these words. E.g.

In [224]:
data_clean[data_clean.str.contains("kg")].head(20)

52                 a&e logistik gmbh & co kg
74                    as frucht gmbh & co kg
296            absolut karriere gmbh & co kg
303                       abtec gmbh & co kg
357                     accemic gmbh & co kg
512                        acsg gmbh & co kg
568                  active blue gmbh & cokg
569                  active blue gmbh & cokg
621           adalbert zajadacz gmbh & co kg
624     adam keller baugeschaft gmbh & co kg
625    adam keller baugeschaeft gmbh & co kg
627               adams consult gmbh & co kg
645                         adc gmbh & co kg
720               adex beratungs gmbh & cokg
789                    adolf lupp gmbh co kg
790               adolf mueller gmbh & co kg
792                    adolf reiss & sohn kg
794                 adolf warth gmbh & co kg
796                  adolf wrth gmbh & co kg
797                adolf wuerth gmbh & co kg
Name: Raw name, dtype: object

In [226]:
stop_words = ["gmbh", "& co", "co", "kg", "cokg", "ltd", "limited", "sl", "inc", "sa", "sarl", "sas", "llc", "dba"]
data_clean = data_clean.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
data_clean.head(20)

0             accesos normalizados
1           altaix electronica sal
2         antala locks & accesoris
3                          anteral
4              arquimea ingenieria
5     a & d environmental services
6                   a & l goodbody
7                     a & p lithos
8         a & t stationers private
9            a a logistikequipment
10                a a z consulting
11                a and m portable
12                      a arnegger
13                 a b fluid power
14                a bis z allround
15                     a e petsche
16                  a e petsche uk
17                   a et p lithos
18              a foubert visserie
19                   a joy wallace
Name: Raw name, dtype: object

The country names (e.g. `france`) could possibly be removed, but some of the entity names contain the country (e.g. `AGENCE FRANCE-PRESSE`). 

In the future, some tests could be done by using a more aggressive stop-words cleaning (e.g. removing other common words such as `technologies`, `services`, `systems`, `solutions`, `consulting`, etc). 

## 4. Finding the good name-similarity metric

## 5. Clustering the names 

## 6. Presenting the result

## 7. Evaluating the performance