<a href="https://colab.research.google.com/github/SohailKhanPAK/Named-Enitity-Recognition/blob/Master/Assignment_3_Fa19_mscs_0064_W3_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing**
by **Sohail Khan**

# **Creating Urdu NER**




**Reading Data from the Google Drive**



In [1]:
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


**Note about the data:**
```
The dataset is a chunk of the dataset retrieved from 
http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
(Workshop on NER for South and South East Asian Languages in IJCNLP 2008 at Hyderabad, India)

The data is reformatted and enriched. Please refer/acknowledge in your paper/report to the original source, if you use this data.
```



In [14]:
import csv
import pandas as pd

with open('/content/drive/My Drive/NER-Dataset/conll-ner.csv', encoding = 'utf-8') as csvfile:
    data = list(csv.reader(csvfile, delimiter=','))
df=pd.DataFrame(data,columns=["word-no", "word", "lemma/root-word", "universal-part-of-speech", "other-part-of-speech","empty", "IBO-tag", "Entity-type"])
df

Unnamed: 0,word-no,word,lemma/root-word,universal-part-of-speech,other-part-of-speech,empty,IBO-tag,Entity-type
0,1,زیرتربیت,زيرتربیت,NOUN,NNC,,O,
1,2,پولیس,پولیس,NOUN,NNC,,O,
2,3,اہلکاروں,اہلكارہ,NOUN,NN,,O,
3,4,کی,کا,ADP,PSP,,O,
4,5,تربیت,تربیت,NOUN,NN,,O,
...,...,...,...,...,...,...,...,...
14949,40,جا,جا,AUX,VAUX,,O,
14950,41,سکے,سک,AUX,VAUX,,O,
14951,1,سماجی,سماجی,ADJ,JJ,,NETO,B
14952,2,ترقیاتی,ترقیاتی,ADJ,JJ,,NETO,I




```
The format of the above output is:
word-no, word, lemma/root-word, universal-part-of-speech, other-part-of-speech,empty, IBO-tag, Entity-type

The first word of a named entity is tagged as B(eginning), 
the other words of the named entity are tagged as I(ntermediate),
and the words not belonging to named entity are tagged as O(ther)

For example
Bill ... Person B
Gates ... Person I
founded ... O
Microsoft ... Organization B

```




Extracting features

In [15]:
def extract_features(words, i):
    wid = words[i][0]

    token = words[i][1]
    upos = words[i][3]
    xpos = words[i][4]
    
    prev_token = ""
    prev_upos = ""
    prev_xpos = "" 
    
    next_token = ""
    next_upos = ""
    next_xpos = ""
    
    if int(wid) != 1: 
        prev_token = words[i-1][1]
        prev_upos = words[i-1][3]
        prev_xpos = words[i-1][4] 
    if  i < len(words)-1:
        if int(wid) < int(words[i+1][0]): 
            next_token = words[i+1][1]
            next_upos = words[i+1][3]
            next_xpos = words[i+1][4]

    is_number = False
    try:
        if float(token):
            is_number = True
    except:
        pass
    
    features_dict = {"token": token
         #  , "upos": upos
         #   , "xpos": xpos          
           , "prev_token": prev_token
        #    , "prev_upos": next_upos  
        #    , "prev_xpos": next_xpos
            , "next_token": next_token
        #   , "next_upos": next_upos
        #    , "next_xpos": next_xpos
        , "is_number": is_number}
    return features_dict

print(data[3:6])
print(extract_features(data, 4))


[['4', 'کی', 'کا', 'ADP', 'PSP', '', 'O', ''], ['5', 'تربیت', 'تربیت', 'NOUN', 'NN', '', 'O', ''], ['6', 'میں', 'میں', 'ADP', 'PSP', '', 'O', '']]
{'token': 'تربیت', 'prev_token': 'کی', 'next_token': 'میں', 'is_number': False}


Converting feature vector for each word

In [None]:
X_features = []
Y = []


for i in range(len(data)):
    try:
        X_features.append(extract_features(data, i))
        Y.append(data[i][6])
    except:
        pass

print(len(X_features),":",len(Y))

14795 : 14795


In [None]:
print(X_features[4:6])
print(Y[4:6])


[{'token': 'تربیت', 'prev_token': 'کی', 'next_token': 'میں', 'is_number': False}, {'token': 'میں', 'prev_token': 'تربیت', 'next_token': 'خامیاں', 'is_number': False}]
['O', 'O']




```
# This is formatted as code
```



Currently many features have string data, we convert it into numeric vectors with a column for each string

In [None]:
from sklearn.feature_extraction import DictVectorizer
vectoriser = DictVectorizer(sparse=True)
X = vectoriser.fit_transform(X_features)

print("Shape of the matrix: ", X.get_shape())
print(X[3])


Shape of the matrix:  (14795, 8572)
  (0, 0)	0.0
  (0, 743)	1.0
  (0, 3309)	1.0
  (0, 8408)	1.0


#Loding the Word2Vec Model
import gensim
model = gensim.models.Word2Vec.load("/content/drive/My Drive/CLT20/urdu-w2vec")

#adding word embedding features
import numpy

lm  = []
Yf = []

e = 0
for i in range(len(X)):
    try:
        m = model[X_features[i]['token']]
        #Xv[i] = numpy.append(X[i],m)
        lm.append(numpy.append(X[i],m))
        Yf.append(Y[i])
    except:
        #print(X_features[i]['token'], Y[i])
        e = e + 1
    
print(e)

X = numpy.array(lm)
Y = Yf


print("number of features: ", len(X[3]))
print(X[3])

**Making training set and training the classifier**

In [None]:
from sklearn import model_selection
from sklearn import svm            # import support vector machine

cl = svm.LinearSVC()
validation_size = 0.20
seed = 7



X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, stratify = Y, test_size=validation_size, random_state=seed)
cl.fit(X_train, Y_train)


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [None]:
print(cl.predict(X_test[4:7]))

['O' 'O' 'O']


In [None]:
print(Y_test[4:7])

['O', 'O', 'O']


In [None]:
from sklearn import metrics
X_predict = cl.predict(X_test)
print(metrics.classification_report(Y_test, X_predict))
print(metrics.confusion_matrix(Y_test, X_predict))


              precision    recall  f1-score   support

         NEA       0.87      0.72      0.79        18
         NED       0.57      0.43      0.49        28
         NEL       0.96      0.89      0.92       167
         NEM       0.85      0.69      0.76        67
         NEN       0.89      0.90      0.89        69
         NEO       0.94      0.60      0.73        25
         NEP       0.94      0.75      0.83       128
        NETE       1.00      0.80      0.89        10
        NETI       0.95      0.88      0.91        42
        NETO       0.50      1.00      0.67         1
        NETP       1.00      0.67      0.80         6
           O       0.96      0.99      0.98      2398

    accuracy                           0.95      2959
   macro avg       0.87      0.78      0.81      2959
weighted avg       0.95      0.95      0.95      2959

[[  13    0    0    0    0    0    0    0    0    0    0    5]
 [   0   12    2    0    0    0    1    0    0    0    0   13]
 [   0 


**Exercise 1:**  Enable the PoSTag attributes in extract_features, then train and evaluate the model.

**Exercise 2:**  Include I and B subtags in the target tag, then run the experiment(s) again. (Hint: The Y would have data[i][6] +  "-" + data[i][7])

**(further) Exercises:** 
(a) Enable the the Word Embedding Code, then train and evaluate the model.

(b) Create other features, then traina and evaluate the model.
e.g. length of token, 3-letter suffix, gazetter lists etc,


K-fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
clf = svm.LinearSVC()

scores = cross_val_score(clf, X, Y, cv=5)
print(scores)

[0.88746198 0.92936803 0.93511321 0.92666441 0.92700237]
