This script processes a dataset of company information and addresses to standardize text fields, geocode addresses, and identify similar companies based on names and locations. Two variations of the algorithm are included, leveraging different approaches to geocoding and similarity matching.
-
Standardization:
- Converts company names and addresses to uppercase, removes extra spaces, and trims unnecessary characters.
-
Geocoding:
- Variation 1: Uses
Nominatimgeocoder with multiprocessing for parallel geocoding. - Variation 2: Implements asynchronous geocoding using
aiohttpfor faster processing.
- Variation 1: Uses
-
Similarity Matching:
- Compares company names using fuzzy matching algorithms to determine similarity.
- Groups companies within a specified distance threshold.
-
Distance Filtering:
- Filters companies based on geographic proximity using geodesic distance calculations (default: 50 miles).
-
Low Similarity Tracking:
- Identifies and separates records with low similarity scores for further review.
The input CSV file should contain the following columns:
Company Name: Name of the company.first3_addresses: Address details.
| Company Name | first3_addresses |
|---|---|
| Example Co. | 123 Main St, Boston, MA |
| Sample LLC | 124 Main Rd, Boston, MA |
| Another LLC | 789 Broadway, New York, NY |
-
Processed Data:
- A CSV file containing the processed data with additional columns:
Latitude: Geographical latitude of the address.Longitude: Geographical longitude of the address.Location Index: Group index for similar companies.
- A CSV file containing the processed data with additional columns:
-
Low Similarity Data:
- A CSV file listing records with overall similarity scores below a specified threshold (default: 68).
-
Approach:
- Uses
multiprocessing.Poolto perform parallel geocoding of addresses. - Suitable for environments where CPU-intensive operations can be distributed across multiple cores.
- Uses
-
Output Example:
Company Name first3_addresses Latitude Longitude Location Index Example Co. 123 Main St, Boston, MA 42.3601 -71.0589 1 Sample LLC 124 Main Rd, Boston, MA 42.3611 -71.0599 1 Another LLC 789 Broadway, New York, NY 40.7128 -74.0060 2
-
Approach:
- Implements asynchronous geocoding using
aiohttpandasynciofor concurrent address resolution. - Efficient for large datasets where latency is critical.
- Implements asynchronous geocoding using
-
Output Example:
Company Name first3_addresses Latitude Longitude Location Index Example Co. 123 Main St, Boston, MA 42.3601 -71.0589 1 Sample LLC 124 Main Rd, Boston, MA 42.3611 -71.0599 1 Another LLC 789 Broadway, New York, NY 40.7128 -74.0060 2
- Add print statements in
merge_companiesto observe how companies are grouped. - For asynchronous geocoding, monitor address processing in real-time using
tqdm.
- Ensure geocoding requests do not exceed the rate limits of
Nominatimby managing retries and delays. - Adjust thresholds for name similarity, address similarity, and distance based on the dataset characteristics.