### Named Entity Recognition (Person) Using NLTK

Using NLTK library to extract names from a text file.

#### Open text file and store content into variable text

In [1]:
with open("NE.txt") as file:
    text = file.read()

print(text[:1000])  # show atmost 1000 characters from the beginning of the file

Eisenstein was born to a middle-class family in Riga, Latvia (then part of the Russian Empire in the Governorate of Livonia), but his family moved frequently in his early years, as Eisenstein continued to do throughout his life. His father, Mikhail Osipovich Eisenstein, was born to a German Jewish father who had converted to Christianity, Osip Eisenstein, and a mother of Swedish descent. His mother, Julia Ivanovna Konetskaya, was from a Russian Orthodox family. According to other sources, both of his paternal grandparents were of Baltic German descent. His father was an architect and his mother was the daughter of a prosperous merchant. Julia left Riga the same year as the Russian Revolution of 1905, taking Sergei with her to St. Petersburg. Her son would return at times to see his father, who joined them around 1910. Divorce followed and Julia left the family to live in France. Eisenstein was raised as an Orthodox Christian, but became an atheist later on.


### Steps
- Import nltk library
- Split the text into sentences using a sentence segmenter `nltk.sent_tokenize` 
- Further subdivide each sentence into words using a word tokenizer `nltk.word_tokenize`
- Tag words with part-of-speech tags using `nltk.pos_tag`, which will prove very helpful in the next step, Named Entity Recognition.
- We then chunk the tagged words using `nltk.ne_chunk`. Chunking aims at grouping elements of the sequence, without any differentiation between obtained groups. For example, noun phrase chunking or verb group chunking.
- After chunking, named entities will be labeled as "PERSON" if the chunk is a person name.

In [2]:
import nltk
from pprint import pprint  # pretty print structured data

# initialize an empty list to store person names
person_names = []

# tokenize sentences
for sentence in nltk.sent_tokenize(text):
    # tokenize words in each sentence
    word_tokens = nltk.word_tokenize(sentence)
    # words tagged with part-of-speech tags
    tagged_words = nltk.pos_tag(word_tokens)
    # then chunk the related words
    for chunk in nltk.ne_chunk(tagged_words):
        if hasattr(chunk, 'label') and chunk.label() == 'PERSON':
            name = ' '.join(leave[0] for leave in chunk.leaves())
            person_names.append(name)

pprint(person_names, indent=2)

[ 'Eisenstein',
  'Eisenstein',
  'Mikhail Osipovich Eisenstein',
  'Osip Eisenstein',
  'Julia Ivanovna Konetskaya',
  'Julia',
  'Sergei',
  'Divorce',
  'Julia',
  'Eisenstein']


### Remove duplicates while maintaining order they appear in the text

In [3]:
from collections import OrderedDict
list(OrderedDict.fromkeys(person_names))

['Eisenstein',
 'Mikhail Osipovich Eisenstein',
 'Osip Eisenstein',
 'Julia Ivanovna Konetskaya',
 'Julia',
 'Sergei',
 'Divorce']