entseeker

entseeker is a command-line tool for Named Entity Recognition (NER) and web entity searches in text files. It uses spaCy's NLP capabilities for standard named entities and custom rules for web-related entities.

Features

Performs NER using spaCy models
Identifies web-related entities (URLs, hostnames, IP addresses, etc.)
Supports multiple spaCy models for different languages
Allows searching for specific entity types or predefined bundles
Option to output results to CSV

Installation

Ensure you have Python 3.6 or later installed.

Clone this repository:

git clone https://github.com/meefs/entseeker.git
cd entseeker

Install the required dependency:
```
pip install spacy
```
Download a spaCy model (the script uses "en_core_web_sm" by default):
```
python -m spacy download en_core_web_sm
```

Usage

Run the script using:

python ents.py <input_file> [options]

Options:

--model: Specify a spaCy model (default: en_core_web_sm)
--entities: Specify entity types to search for
--types: Use predefined bundles of entity types
--csv: Output results to a CSV file

Available Models:

en_core_web_sm: English pipeline (small, 12 MB)
en_core_web_md: English pipeline (medium, 40 MB)
en_core_web_lg: English pipeline (large, 560 MB)
en_core_web_trf: English pipeline with transformer (extra large, 438 MB)
xx_ent_wiki_sm: Multi-language NER model (small, 12 MB)
de_core_news_sm: German pipeline (small, 14 MB)
es_core_news_sm: Spanish pipeline (small, 14 MB)
fr_core_news_sm: French pipeline (small, 14 MB)
it_core_news_sm: Italian pipeline (small, 14 MB)
nl_core_news_sm: Dutch pipeline (small, 13 MB)
pt_core_news_sm: Portuguese pipeline (small, 14 MB)

Entity Types:

Type	Description	Example
PERSON	People, including fictional	John Smith, Harry Potter
NORP	Nationalities or religious or political groups	American, Buddhist
FAC	Buildings, airports, highways, bridges, etc.	Empire State Building, JFK Airport
ORG	Companies, agencies, institutions, etc.	Microsoft, FBI, MIT
GPE	Countries, cities, states	France, New York City, Texas
LOC	Non-GPE locations, mountain ranges, bodies of water	Alps, Pacific Ocean
PRODUCT	Objects, vehicles, foods, etc. (Not services)	iPhone, Boeing 747
EVENT	Named hurricanes, battles, wars, sports events, etc.	World War II, Super Bowl
WORK_OF_ART	Titles of books, songs, etc.	"To Kill a Mockingbird"
LAW	Named documents made into laws	U.S. Constitution
LANGUAGE	Any named language	English, Spanish
DATE	Absolute or relative dates or periods	July 4th, last week
TIME	Times smaller than a day	3:30 pm, midnight
PERCENT	Percentage, including "%"	50%, fifty percent
MONEY	Monetary values, including unit	$100, 50 euros
QUANTITY	Measurements, as of weight or distance	10 km, 20 pounds
ORDINAL	"first", "second", etc.	First, 2nd
CARDINAL	Numerals that do not fall under another type	500, ten
URL	Web addresses	https://www.example.com
HOSTNAME	Domain names	example.com
IP_ADDRESS	IPv4 or IPv6 addresses	192.168.1.1
PORT	Network port numbers	:8080
PROTOCOL	Network protocols	http://, ftp://

Predefined Type Bundles:

people: PERSON
organizations: ORG, NORP
places: GPE, LOC, FAC
things: PRODUCT, WORK_OF_ART, LAW, LANGUAGE
events: EVENT
dates: DATE, TIME
numbers: PERCENT, MONEY, QUANTITY, ORDINAL, CARDINAL
web: URL, HOSTNAME, IP_ADDRESS, PORT, PROTOCOL

Examples:

Search for all entities in a file:
```
python ents.py input.txt
```

Search for specific entity types:

python ents.py input.txt --entities PERSON ORG

Use predefined entity type bundles:

python ents.py input.txt --types people places web

Use a different spaCy model:

python ents.py input.txt --model en_core_web_lg

Output results to a CSV file:

python ents.py input.txt --csv output.csv

Combine multiple options:

python ents.py input.txt --model en_core_web_md --types people organizations web --csv output.csv

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
ents.py		ents.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

entseeker

Features

Installation

Usage

Options:

Available Models:

Entity Types:

Predefined Type Bundles:

Examples:

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

meefs/entseeker

Folders and files

Latest commit

History

Repository files navigation

entseeker

Features

Installation

Usage

Options:

Available Models:

Entity Types:

Predefined Type Bundles:

Examples:

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages