#### Name : Syed Khalid Ahmed
#### Marticulation number : 276970

# Named Entity Recognition (NER) using Python and nltk-trainer

In this task, I have used nltk-trainer, which is an open-source tool for Natural Language Processing. I have used their scripts to train the models using Naive Bayes classifier. I tried to implement the algorithm by myself on the available data but I was not able to model the whole process.

## POS tagger function

I have implemented a POS tagger for tagging words to be used in NER algorithm. I have followed the link on nltk website and create a combined tagger. 

The file POS-tagger.py contains the following code

In [1]:
#!/usr/bin/env python

import nltk
from nltk.corpus import conll2000 
from pickle import dump

## Here I have implemented a combined tagger to increase the accuracy of the tagger
def BackoffTaggers(train_data,test_data):

    Default_tagger = nltk.DefaultTagger('NN')
    Unigram_tagger = nltk.UnigramTagger(train_data, backoff=Default_tagger)
    Bigram_tagger = nltk.BigramTagger(train_data, backoff=Unigram_tagger)

    print("\nAccracy on the test data comes out to be : ",end='')
    print(Bigram_tagger.evaluate(test_data))  # Evaluating the accuracy on test data  
    
    ## Dump the tagger in a file to be used later
    output = open('tagged_data.pkl','wb')
    dump(Bigram_tagger,output, -1)
    output.close()
    
    print("\nDumped the data in a file for later use")

## Function to read the CONLL corpus
def read_conll(path):
    result = []
    file = open(path)
    sent = []
    for line in file:
        line = line.strip('\n')
        if not line.strip(' '):
            result.append(sent)
            sent = []
            continue
        (word,pos,tag) = line.split(' ')
        sent.append((word,pos))     # storing only word and POS to train the tagger
    return result

if __name__=="__main__":
    

    conll_train = read_conll('train.txt')   # Read the CONLL training text
    conll_test = read_conll('test.txt')     # Read the CONLL testing text

    BackoffTaggers(conll_train,conll_test)  # Call the function
    



Accracy on the test data comes out to be : 0.9174493952761889

Dumped the data in a file for later use


## Named Entity Recognition 

Here, I have used the script provided by the nltk-trainer library to train the Naive-Bayes classifier on the training data of the CONLL2000 corpus. I have attached the relevant screenshot also

![title](Training.png)

## Main Program

By using the above generated objects from POS tagger function and Named Entity Recognition function, we can now combine them and create a NER program.

The file Main_Program.py contains the following code.

In [4]:
#!/usr/bin/env python

import pickle
from pickle import load
import nltk

if __name__ == "__main__":

    file = open('tagged_data.pkl','rb')     # Open the file containing the POS tagger

    POS_tagger = pickle.load(file)          # Load the tagger object using pickle

    ## Load the saved NER classifier object
    ## This file was generated by the nltk-trainer python script which
    ## generates a Naive-Bayes trained classifier
    NER_classifier = nltk.data.load(('conll2000_NaiveBayes.pickle'))    

    string = "Germany is a very beautiful country".split()

    ## Tag the words using the previously made tagger
    tagged_data = POS_tagger.tag(string)

    ## Find the named entities using the NER classifier and the tagger
    output = NER_classifier.parse(tagged_data)

    ## Print the output
    print(output)


(S
  (NP Germany/NNP)
  (VP is/VBZ)
  (NP a/DT very/RB beautiful/JJ country/NN))


As seen in the output, the program separates the named entities. "Germany" is a noun and "a very beautiful country" is also separated.