In [3]:
text = """The city of New York, located in the United States, is known for its iconic landmarks and diverse population. From the towering skyscrapers of Manhattan to the beautiful green spaces of Central Park, there is no shortage of things to see and do in this bustling metropolis. One of the most famous landmarks in New York is the Statue of Liberty, a colossal copper statue that stands proudly on Liberty Island in the New York Harbor. It was a gift from the people of France to the United States and has become a symbol of freedom and democracy. Thousands of tourists visit the statue each year, marveling at its grandeur and historical significance. Another must-visit location in New York is Times Square, often referred to as "The Crossroads of the World." Known for its dazzling billboards and vibrant atmosphere, Times Square is a hub of entertainment, shopping, and dining. It is particularly famous for its annual New Year's Eve celebration, where thousands of people gather to watch the iconic ball drop at midnight. New York is also home to several world-renowned museums. The Metropolitan Museum of Art, located on the eastern edge of Central Park, houses an extensive collection of art and artifacts from around the globe. Visitors can admire works by renowned artists such as Vincent van Gogh, Pablo Picasso, and Leonardo da Vinci. For those interested in natural history, the American Museum of Natural History is a treasure trove of knowledge. From dinosaur fossils to exhibits on space exploration, this museum offers a fascinating glimpse into the natural world and the history of our planet. Moving away from Manhattan, Brooklyn is another borough that attracts visitors with its unique charm. The Brooklyn Bridge, an iconic suspension bridge spanning the East River, connects Manhattan to Brooklyn. Walking across the bridge provides stunning views of the city skyline and the river below. Brooklyn is also known for its vibrant neighborhoods, such as Williamsburg and DUMBO (Down Under the Manhattan Bridge Overpass). These areas are renowned for their artistic communities, trendy boutiques, and lively nightlife. In addition to its cultural attractions, New York is a melting pot of different cuisines. From street food carts offering hot dogs and pretzels to upscale restaurants serving international delicacies, the city caters to every culinary taste. Sports fans can indulge in their passion by catching a game at one of New York's famous sports arenas. Yankee Stadium, home to the New York Yankees baseball team, is a legendary venue where fans can cheer on their favorite players. Madison Square Garden, located in Midtown Manhattan, hosts basketball and ice hockey games, as well as concerts and other live events. New York's transportation system, including its iconic yellow taxis and extensive subway network, makes it easy to navigate the city. The subway system alone covers a vast area and provides access to all corners of the city, allowing residents and visitors to explore with ease. In conclusion, New York is a city that offers a myriad of experiences for visitors. Whether you're interested in history, art, food, or simply immersing yourself in the vibrant energy of a global metropolis, there is something for everyone in the Big Apple."""

In [2]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

In [3]:
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize

tokens = word_tokenize(text)
tagged = nltk.pos_tag(tokens)

entities = ne_chunk(tagged)

for subtree in entities.subtrees():
    if subtree.label() != 'S':
        print(' '.join([word for word, _ in subtree.leaves()]), subtree.label())

New York GPE
United States GPE
Manhattan GPE
Central Park ORGANIZATION
New York GPE
Liberty GPE
Liberty Island ORGANIZATION
New York Harbor GPE
France GPE
United States GPE
New York GPE
Times Square PERSON
Crossroads ORGANIZATION
Times Square PERSON
New Year ORGANIZATION
New York GPE
Metropolitan Museum ORGANIZATION
Art GPE
Central Park ORGANIZATION
Vincent ORGANIZATION
Gogh PERSON
Pablo Picasso PERSON
Leonardo PERSON
American Museum ORGANIZATION
Natural History ORGANIZATION
Manhattan GPE
Brooklyn GPE
Brooklyn Bridge ORGANIZATION
East River LOCATION
Manhattan GPE
Brooklyn GPE
Brooklyn PERSON
Williamsburg PERSON
DUMBO ORGANIZATION
Manhattan Bridge Overpass FACILITY
New York GPE
New York GPE
Yankee Stadium PERSON
New York GPE
Yankees ORGANIZATION
Madison Square Garden PERSON
Midtown Manhattan GPE
New York GPE
New York GPE


In [4]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

ner_results = nlp(text)
print(ner_results)

[{'entity': 'B-LOC', 'score': 0.99955016, 'index': 4, 'word': 'New', 'start': 12, 'end': 15}, {'entity': 'I-LOC', 'score': 0.9993405, 'index': 5, 'word': 'York', 'start': 16, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9995314, 'index': 10, 'word': 'United', 'start': 37, 'end': 43}, {'entity': 'I-LOC', 'score': 0.9992786, 'index': 11, 'word': 'States', 'start': 44, 'end': 50}, {'entity': 'B-LOC', 'score': 0.99960834, 'index': 30, 'word': 'Manhattan', 'start': 143, 'end': 152}, {'entity': 'B-LOC', 'score': 0.99775547, 'index': 37, 'word': 'Central', 'start': 186, 'end': 193}, {'entity': 'I-LOC', 'score': 0.9985898, 'index': 38, 'word': 'Park', 'start': 194, 'end': 198}, {'entity': 'B-LOC', 'score': 0.999579, 'index': 64, 'word': 'New', 'start': 310, 'end': 313}, {'entity': 'I-LOC', 'score': 0.99939036, 'index': 65, 'word': 'York', 'start': 314, 'end': 318}, {'entity': 'B-LOC', 'score': 0.88721657, 'index': 68, 'word': 'St', 'start': 326, 'end': 328}, {'entity': 'I-LOC', 'score': 0.949807

In [5]:
data = [{'entity': 'B-LOC', 'score': 0.99955016, 'index': 4, 'word': 'New', 'start': 12, 'end': 15}, {'entity': 'I-LOC', 'score': 0.9993405, 'index': 5, 'word': 'York', 'start': 16, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9995314, 'index': 10, 'word': 'United', 'start': 37, 'end': 43}, {'entity': 'I-LOC', 'score': 0.9992786, 'index': 11, 'word': 'States', 'start': 44, 'end': 50}, {'entity': 'B-LOC', 'score': 0.99960834, 'index': 30, 'word': 'Manhattan', 'start': 143, 'end': 152}, {'entity': 'B-LOC', 'score': 0.99775547, 'index': 37, 'word': 'Central', 'start': 186, 'end': 193}, {'entity': 'I-LOC', 'score': 0.9985898, 'index': 38, 'word': 'Park', 'start': 194, 'end': 198}, {'entity': 'B-LOC', 'score': 0.999579, 'index': 64, 'word': 'New', 'start': 310, 'end': 313}, {'entity': 'I-LOC', 'score': 0.99939036, 'index': 65, 'word': 'York', 'start': 314, 'end': 318}, {'entity': 'B-LOC', 'score': 0.88721657, 'index': 68, 'word': 'St', 'start': 326, 'end': 328}, {'entity': 'I-LOC', 'score': 0.9498079, 'index': 69, 'word': '##at', 'start': 328, 'end': 330}, {'entity': 'I-LOC', 'score': 0.9367647, 'index': 70, 'word': '##ue', 'start': 330, 'end': 332}, {'entity': 'I-LOC', 'score': 0.9302876, 'index': 71, 'word': 'of', 'start': 333, 'end': 335}, {'entity': 'I-LOC', 'score': 0.8262548, 'index': 72, 'word': 'Liberty', 'start': 336, 'end': 343}, {'entity': 'B-LOC', 'score': 0.9981761, 'index': 84, 'word': 'Liberty', 'start': 393, 'end': 400}, {'entity': 'I-LOC', 'score': 0.9974763, 'index': 85, 'word': 'Island', 'start': 401, 'end': 407}, {'entity': 'B-LOC', 'score': 0.99934226, 'index': 88, 'word': 'New', 'start': 415, 'end': 418}, {'entity': 'I-LOC', 'score': 0.9991905, 'index': 89, 'word': 'York', 'start': 419, 'end': 423}, {'entity': 'I-LOC', 'score': 0.9991866, 'index': 90, 'word': 'Harbor', 'start': 424, 'end': 430}, {'entity': 'B-LOC', 'score': 0.99960464, 'index': 100, 'word': 'France', 'start': 465, 'end': 471}, {'entity': 'B-LOC', 'score': 0.9996244, 'index': 103, 'word': 'United', 'start': 479, 'end': 485}, {'entity': 'I-LOC', 'score': 0.9993771, 'index': 104, 'word': 'States', 'start': 486, 'end': 492}, {'entity': 'B-LOC', 'score': 0.9994079, 'index': 141, 'word': 'New', 'start': 679, 'end': 682}, {'entity': 'I-LOC', 'score': 0.99932325, 'index': 142, 'word': 'York', 'start': 683, 'end': 687}, {'entity': 'B-LOC', 'score': 0.9970278, 'index': 144, 'word': 'Times', 'start': 691, 'end': 696}, {'entity': 'I-LOC', 'score': 0.9978264, 'index': 145, 'word': 'Square', 'start': 697, 'end': 703}, {'entity': 'B-LOC', 'score': 0.85796714, 'index': 152, 'word': 'The', 'start': 727, 'end': 730}, {'entity': 'I-LOC', 'score': 0.88495773, 'index': 153, 'word': 'Cross', 'start': 731, 'end': 736}, {'entity': 'I-LOC', 'score': 0.98921853, 'index': 154, 'word': '##roads', 'start': 736, 'end': 741}, {'entity': 'I-LOC', 'score': 0.9706815, 'index': 155, 'word': 'of', 'start': 742, 'end': 744}, {'entity': 'I-LOC', 'score': 0.94512326, 'index': 156, 'word': 'the', 'start': 745, 'end': 748}, {'entity': 'I-LOC', 'score': 0.9243376, 'index': 157, 'word': 'World', 'start': 749, 'end': 754}, {'entity': 'B-LOC', 'score': 0.99823666, 'index': 171, 'word': 'Times', 'start': 815, 'end': 820}, {'entity': 'I-LOC', 'score': 0.99850583, 'index': 172, 'word': 'Square', 'start': 821, 'end': 827}, {'entity': 'B-MISC', 'score': 0.9820154, 'index': 191, 'word': 'New', 'start': 918, 'end': 921}, {'entity': 'I-MISC', 'score': 0.6506672, 'index': 192, 'word': 'Year', 'start': 922, 'end': 926}, {'entity': 'I-MISC', 'score': 0.6719681, 'index': 193, 'word': "'", 'start': 926, 'end': 927}, {'entity': 'I-MISC', 'score': 0.9683382, 'index': 194, 'word': 's', 'start': 927, 'end': 928}, {'entity': 'I-MISC', 'score': 0.72657925, 'index': 195, 'word': 'Eve', 'start': 929, 'end': 932}, {'entity': 'B-LOC', 'score': 0.9988223, 'index': 212, 'word': 'New', 'start': 1022, 'end': 1025}, {'entity': 'I-LOC', 'score': 0.99901044, 'index': 213, 'word': 'York', 'start': 1026, 'end': 1030}, {'entity': 'B-ORG', 'score': 0.5866418, 'index': 225, 'word': 'Metropolitan', 'start': 1083, 'end': 1095}, {'entity': 'I-ORG', 'score': 0.54306406, 'index': 226, 'word': 'Museum', 'start': 1096, 'end': 1102}, {'entity': 'I-ORG', 'score': 0.84081864, 'index': 227, 'word': 'of', 'start': 1103, 'end': 1105}, {'entity': 'I-ORG', 'score': 0.89406323, 'index': 228, 'word': 'Art', 'start': 1106, 'end': 1109}, {'entity': 'B-LOC', 'score': 0.9943334, 'index': 236, 'word': 'Central', 'start': 1142, 'end': 1149}, {'entity': 'I-LOC', 'score': 0.9971675, 'index': 237, 'word': 'Park', 'start': 1150, 'end': 1154}, {'entity': 'B-PER', 'score': 0.9990947, 'index': 261, 'word': 'Vincent', 'start': 1285, 'end': 1292}, {'entity': 'I-PER', 'score': 0.99840707, 'index': 262, 'word': 'van', 'start': 1293, 'end': 1296}, {'entity': 'I-PER', 'score': 0.94931996, 'index': 263, 'word': 'Go', 'start': 1297, 'end': 1299}, {'entity': 'I-PER', 'score': 0.46651295, 'index': 264, 'word': '##gh', 'start': 1299, 'end': 1301}, {'entity': 'B-PER', 'score': 0.99920195, 'index': 266, 'word': 'Pablo', 'start': 1303, 'end': 1308}, {'entity': 'I-PER', 'score': 0.99813545, 'index': 267, 'word': 'Picasso', 'start': 1309, 'end': 1316}, {'entity': 'B-PER', 'score': 0.9962966, 'index': 270, 'word': 'Leonardo', 'start': 1322, 'end': 1330}, {'entity': 'I-PER', 'score': 0.9761723, 'index': 271, 'word': 'da', 'start': 1331, 'end': 1333}, {'entity': 'I-PER', 'score': 0.9486597, 'index': 272, 'word': 'Vinci', 'start': 1334, 'end': 1339}, {'entity': 'B-ORG', 'score': 0.843593, 'index': 282, 'word': 'American', 'start': 1386, 'end': 1394}, {'entity': 'I-ORG', 'score': 0.8173209, 'index': 283, 'word': 'Museum', 'start': 1395, 'end': 1401}, {'entity': 'I-ORG', 'score': 0.922327, 'index': 284, 'word': 'of', 'start': 1402, 'end': 1404}, {'entity': 'I-ORG', 'score': 0.94044715, 'index': 285, 'word': 'Natural', 'start': 1405, 'end': 1412}, {'entity': 'I-ORG', 'score': 0.8508997, 'index': 286, 'word': 'History', 'start': 1413, 'end': 1420}, {'entity': 'B-LOC', 'score': 0.99939, 'index': 324, 'word': 'Manhattan', 'start': 1623, 'end': 1632}, {'entity': 'B-LOC', 'score': 0.99857175, 'index': 326, 'word': 'Brooklyn', 'start': 1634, 'end': 1642}, {'entity': 'B-LOC', 'score': 0.97477573, 'index': 339, 'word': 'Brooklyn', 'start': 1712, 'end': 1720}, {'entity': 'I-LOC', 'score': 0.9914055, 'index': 340, 'word': 'Bridge', 'start': 1721, 'end': 1727}, {'entity': 'B-LOC', 'score': 0.9977081, 'index': 348, 'word': 'East', 'start': 1770, 'end': 1774}, {'entity': 'I-LOC', 'score': 0.998262, 'index': 349, 'word': 'River', 'start': 1775, 'end': 1780}, {'entity': 'B-LOC', 'score': 0.9995029, 'index': 352, 'word': 'Manhattan', 'start': 1791, 'end': 1800}, {'entity': 'B-LOC', 'score': 0.998953, 'index': 354, 'word': 'Brooklyn', 'start': 1804, 'end': 1812}, {'entity': 'B-LOC', 'score': 0.9987016, 'index': 373, 'word': 'Brooklyn', 'start': 1905, 'end': 1913}, {'entity': 'B-ORG', 'score': 0.48051494, 'index': 384, 'word': 'Williams', 'start': 1967, 'end': 1975}, {'entity': 'B-ORG', 'score': 0.62387514, 'index': 387, 'word': 'D', 'start': 1984, 'end': 1985}, {'entity': 'I-ORG', 'score': 0.6471387, 'index': 388, 'word': '##UM', 'start': 1985, 'end': 1987}, {'entity': 'B-LOC', 'score': 0.986362, 'index': 394, 'word': 'Manhattan', 'start': 2006, 'end': 2015}, {'entity': 'I-LOC', 'score': 0.7894009, 'index': 395, 'word': 'Bridge', 'start': 2016, 'end': 2022}, {'entity': 'B-LOC', 'score': 0.63534355, 'index': 426, 'word': 'New', 'start': 2172, 'end': 2175}, {'entity': 'I-LOC', 'score': 0.9522842, 'index': 427, 'word': 'York', 'start': 2176, 'end': 2180}, {'entity': 'B-LOC', 'score': 0.67674273, 'index': 485, 'word': 'New', 'start': 2443, 'end': 2446}, {'entity': 'I-LOC', 'score': 0.92480934, 'index': 486, 'word': 'York', 'start': 2447, 'end': 2451}, {'entity': 'B-LOC', 'score': 0.50123787, 'index': 494, 'word': 'Yankee', 'start': 2476, 'end': 2482}, {'entity': 'B-LOC', 'score': 0.625058, 'index': 500, 'word': 'New', 'start': 2504, 'end': 2507}, {'entity': 'I-LOC', 'score': 0.92146415, 'index': 501, 'word': 'York', 'start': 2508, 'end': 2512}, {'entity': 'B-ORG', 'score': 0.41877195, 'index': 502, 'word': 'Yankees', 'start': 2513, 'end': 2520}]

formatted_data = []

for item in data:
    entity_type = item['entity']
    word = item['word']
    start = item['start']
    end = item['end']
    
    formatted_data.append((entity_type, word, start, end))

# Print the formatted data
for entity_type, word, start, end in formatted_data:
    print(f"Entity: {entity_type}, Word: {word}, Start: {start}, End: {end}")

Entity: B-LOC, Word: New, Start: 12, End: 15
Entity: I-LOC, Word: York, Start: 16, End: 20
Entity: B-LOC, Word: United, Start: 37, End: 43
Entity: I-LOC, Word: States, Start: 44, End: 50
Entity: B-LOC, Word: Manhattan, Start: 143, End: 152
Entity: B-LOC, Word: Central, Start: 186, End: 193
Entity: I-LOC, Word: Park, Start: 194, End: 198
Entity: B-LOC, Word: New, Start: 310, End: 313
Entity: I-LOC, Word: York, Start: 314, End: 318
Entity: B-LOC, Word: St, Start: 326, End: 328
Entity: I-LOC, Word: ##at, Start: 328, End: 330
Entity: I-LOC, Word: ##ue, Start: 330, End: 332
Entity: I-LOC, Word: of, Start: 333, End: 335
Entity: I-LOC, Word: Liberty, Start: 336, End: 343
Entity: B-LOC, Word: Liberty, Start: 393, End: 400
Entity: I-LOC, Word: Island, Start: 401, End: 407
Entity: B-LOC, Word: New, Start: 415, End: 418
Entity: I-LOC, Word: York, Start: 419, End: 423
Entity: I-LOC, Word: Harbor, Start: 424, End: 430
Entity: B-LOC, Word: France, Start: 465, End: 471
Entity: B-LOC, Word: United, St