The entity matching model is designed to predict the similarity between pairs of entity names, specifically company names. It's based on the DistilBERT transformer model and deployed as an API for easy integration into various applications.
- Company names from different data sources don’t always match because of different word orders / spaces & special characters/ abbreviations/ typos/ changes in company type (GmbH -> AG) / prefixes and suffixes, etc.
- This is a common data science problem: some examples here and here
- The data contains a large list of companies with both positive and negative matches.
- Initial dataset contains 7042846 labeled entity pairs. A small sample from this dataset is available in the data_sample.csv file.
- Docker
- Python 3.8 (if running locally without Docker)
-
Building the Docker Image
docker build -t entity_matcher .
-
Running the Docker Container
docker run -p 80:80 entity_matcher
-
(OR) Running without Docker Container
# Create a virtual environment (optional but recommended) python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` # Install dependencies pip install -r requirements.txt cd app uvicorn main:app --reload
The API will now be accessible at http://localhost.
You can make POST requests to the /predict
endpoint with a pair of entity names. The response will include the prediction and the associated probability.
Example Request
POST /predict
Content-Type: application/json
{
"entity_1": "Example Company A",
"entity_2": "Example Co. A"
}
Example Response
{
"prediction": 1,
"probability": 0.999990
}