entseeker is a command-line tool for Named Entity Recognition (NER) and web entity searches in text files. It uses spaCy's NLP capabilities for standard named entities and custom rules for web-related entities.
- Performs NER using spaCy models
- Identifies web-related entities (URLs, hostnames, IP addresses, etc.)
- Supports multiple spaCy models for different languages
- Allows searching for specific entity types or predefined bundles
- Option to output results to CSV
-
Ensure you have Python 3.6 or later installed.
-
Clone this repository:
git clone https://github.com/meefs/entseeker.git cd entseeker
-
Install the required dependency:
pip install spacy
-
Download a spaCy model (the script uses "en_core_web_sm" by default):
python -m spacy download en_core_web_sm
Run the script using:
python ents.py <input_file> [options]
--model
: Specify a spaCy model (default: en_core_web_sm)--entities
: Specify entity types to search for--types
: Use predefined bundles of entity types--csv
: Output results to a CSV file
en_core_web_sm
: English pipeline (small, 12 MB)en_core_web_md
: English pipeline (medium, 40 MB)en_core_web_lg
: English pipeline (large, 560 MB)en_core_web_trf
: English pipeline with transformer (extra large, 438 MB)xx_ent_wiki_sm
: Multi-language NER model (small, 12 MB)de_core_news_sm
: German pipeline (small, 14 MB)es_core_news_sm
: Spanish pipeline (small, 14 MB)fr_core_news_sm
: French pipeline (small, 14 MB)it_core_news_sm
: Italian pipeline (small, 14 MB)nl_core_news_sm
: Dutch pipeline (small, 13 MB)pt_core_news_sm
: Portuguese pipeline (small, 14 MB)
Type | Description | Example |
---|---|---|
PERSON | People, including fictional | John Smith, Harry Potter |
NORP | Nationalities or religious or political groups | American, Buddhist |
FAC | Buildings, airports, highways, bridges, etc. | Empire State Building, JFK Airport |
ORG | Companies, agencies, institutions, etc. | Microsoft, FBI, MIT |
GPE | Countries, cities, states | France, New York City, Texas |
LOC | Non-GPE locations, mountain ranges, bodies of water | Alps, Pacific Ocean |
PRODUCT | Objects, vehicles, foods, etc. (Not services) | iPhone, Boeing 747 |
EVENT | Named hurricanes, battles, wars, sports events, etc. | World War II, Super Bowl |
WORK_OF_ART | Titles of books, songs, etc. | "To Kill a Mockingbird" |
LAW | Named documents made into laws | U.S. Constitution |
LANGUAGE | Any named language | English, Spanish |
DATE | Absolute or relative dates or periods | July 4th, last week |
TIME | Times smaller than a day | 3:30 pm, midnight |
PERCENT | Percentage, including "%" | 50%, fifty percent |
MONEY | Monetary values, including unit | $100, 50 euros |
QUANTITY | Measurements, as of weight or distance | 10 km, 20 pounds |
ORDINAL | "first", "second", etc. | First, 2nd |
CARDINAL | Numerals that do not fall under another type | 500, ten |
URL | Web addresses | https://www.example.com |
HOSTNAME | Domain names | example.com |
IP_ADDRESS | IPv4 or IPv6 addresses | 192.168.1.1 |
PORT | Network port numbers | :8080 |
PROTOCOL | Network protocols | http://, ftp:// |
people
: PERSONorganizations
: ORG, NORPplaces
: GPE, LOC, FACthings
: PRODUCT, WORK_OF_ART, LAW, LANGUAGEevents
: EVENTdates
: DATE, TIMEnumbers
: PERCENT, MONEY, QUANTITY, ORDINAL, CARDINALweb
: URL, HOSTNAME, IP_ADDRESS, PORT, PROTOCOL
-
Search for all entities in a file:
python ents.py input.txt
-
Search for specific entity types:
python ents.py input.txt --entities PERSON ORG
-
Use predefined entity type bundles:
python ents.py input.txt --types people places web
-
Use a different spaCy model:
python ents.py input.txt --model en_core_web_lg
-
Output results to a CSV file:
python ents.py input.txt --csv output.csv
-
Combine multiple options:
python ents.py input.txt --model en_core_web_md --types people organizations web --csv output.csv
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source and available under the MIT License.