Skip to content

Serggorsar/entity_matcher

Repository files navigation

Entity Matching Model For Company Names

Overview

The entity matching model is designed to predict the similarity between pairs of entity names, specifically company names. It's based on the DistilBERT transformer model and deployed as an API for easy integration into various applications.

The Challenge

  • Company names from different data sources don’t always match because of different word orders / spaces & special characters/ abbreviations/ typos/ changes in company type (GmbH -> AG) / prefixes and suffixes, etc.
  • This is a common data science problem: some examples here and here
  • The data contains a large list of companies with both positive and negative matches.
  • Initial dataset contains 7042846 labeled entity pairs. A small sample from this dataset is available in the data_sample.csv file.

Getting Started

Prerequisites

  • Docker
  • Python 3.8 (if running locally without Docker)

Running

  1. Building the Docker Image

    docker build -t entity_matcher .
    
  2. Running the Docker Container

    docker run -p 80:80 entity_matcher
    
  3. (OR) Running without Docker Container

    # Create a virtual environment (optional but recommended)
    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    
    # Install dependencies
    pip install -r requirements.txt
    
    cd app
    uvicorn main:app --reload
    

The API will now be accessible at http://localhost.

Usage

You can make POST requests to the /predict endpoint with a pair of entity names. The response will include the prediction and the associated probability.

Example Request

POST /predict
Content-Type: application/json

{
  "entity_1": "Example Company A",
  "entity_2": "Example Co. A"
}

Example Response

{
  "prediction": 1,
  "probability": 0.999990
}

About

Project for entity matching of company names. Includes DistilBERT transformer-based model training, analytics, and FastAPI application creation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published