Entity Matching Model For Company Names

Overview

The entity matching model is designed to predict the similarity between pairs of entity names, specifically company names. It's based on the DistilBERT transformer model and deployed as an API for easy integration into various applications.

The Challenge

Company names from different data sources don’t always match because of different word orders / spaces & special characters/ abbreviations/ typos/ changes in company type (GmbH -> AG) / prefixes and suffixes, etc.
This is a common data science problem: some examples here and here
The data contains a large list of companies with both positive and negative matches.
Initial dataset contains 7042846 labeled entity pairs. A small sample from this dataset is available in the data_sample.csv file.

Getting Started

Prerequisites

Docker
Python 3.8 (if running locally without Docker)

Running

Building the Docker Image
```
docker build -t entity_matcher .
```
Running the Docker Container
```
docker run -p 80:80 entity_matcher
```

(OR) Running without Docker Container

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

# Install dependencies
pip install -r requirements.txt

cd app
uvicorn main:app --reload

The API will now be accessible at http://localhost.

Usage

You can make POST requests to the /predict endpoint with a pair of entity names. The response will include the prediction and the associated probability.

Example Request

POST /predict
Content-Type: application/json

{
  "entity_1": "Example Company A",
  "entity_2": "Example Co. A"
}

Example Response

{
  "prediction": 1,
  "probability": 0.999990
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
.gitattributes		.gitattributes
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
data_sample.csv		data_sample.csv
model_training.ipynb		model_training.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

.gitattributes

.gitattributes

Dockerfile

Dockerfile

LICENSE.md

LICENSE.md

README.md

README.md

data_sample.csv

data_sample.csv

model_training.ipynb

model_training.ipynb

requirements.txt

requirements.txt

Repository files navigation

Entity Matching Model For Company Names

Overview

The Challenge

Getting Started

Prerequisites

Running

Usage

About

Releases

Packages

Languages

License

Serggorsar/entity_matcher

Folders and files

Latest commit

History

Repository files navigation

Entity Matching Model For Company Names

Overview

The Challenge

Getting Started

Prerequisites

Running

Usage

About

Resources

License

Stars

Watchers

Forks

Languages