**Instruction for POS Tagging Using RNNs with Arabic Dataset**

**Dataset:**
The dataset provided is named "Assignment 2 - Arabic POS.conllu". It contains labeled data for Arabic text with Part-of-Speech (POS) tags in CoNLL-U format.

**Objective:**
Your objective is to perform Part-of-Speech (POS) tagging on Arabic text using Recurrent Neural Networks (RNNs). Specifically, you will use the Universal POS (UPOS) tags for tagging. UPOS is a standardized set of POS tags that aims to cover all languages.

**Evaluation metric:**
Accuracy

**Instructions:**
1. **Data Preprocessing:**
   - Load the provided dataset "Assignment 2 - Arabic POS.conllu". You can use pyconll library
   - Preprocess the data as necessary, including tokenization

2. **Model Building:**
   - Design an RNN-based model architecture suitable for POS tagging. You may consider using recurrent layers such as (LSTM) or (GRU).
   - Define the input and output layers of the model. The input layer should accept sequences of tokens, and the output layer should produce the predicted UPOS tags for each token.

3. **Training:**

4. **Evaluation:**

**Additional Notes:**
- Make sure to document your code thoroughly and provide clear explanations for each step.
- Feel free to explore different RNN architectures, hyperparameters, and optimization techniques to improve the model's accuracy.

### Import used libraries

In [11]:
import pyconll
import numpy as np
from collections import Counter
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, TimeDistributed, Dropout, Bidirectional
import tensorflow as tf
from sklearn.model_selection import train_test_split

### Load Dataset

In [3]:
conllu_file = 'Assignment 2 - Arabic POS.conllu'
data = pyconll.load_from_file(conllu_file)

### Data splitting

In [7]:
sentences = []
pos_tags = []

for sentence in data:
    tokens = []
    tags = []
    for token in sentence:
        if token.form and token.upos:
            tokens.append(token.form)
            tags.append(token.upos)
    sentences.append(tokens)
    pos_tags.append(tags)

In [8]:
sentences

[['برلين',
  'ترفض',
  'حصول',
  'شركة',
  'اميركية',
  'على',
  'رخصة',
  'تصنيع',
  'دبابة',
  '"',
  'ليوبارد',
  '"',
  'الالمانية'],
 ['برلين',
  '15',
  '-',
  '7',
  '(',
  'اف',
  'ب',
  ')',
  '-',
  'افادت',
  'صحيفة',
  'الاحد',
  'الالمانية',
  '"',
  'ويلت',
  'ام',
  'سونتاغ',
  '"',
  'في',
  'عدد',
  'ها',
  'الصادر',
  'غدا',
  '،',
  'ان',
  'المستشار',
  'غيرهارد',
  'شرودر',
  'يرفض',
  'حصول',
  'المجموعة',
  'الاميركية',
  '"',
  'جنرال',
  'ديناميكس',
  '"',
  'على',
  'رخصة',
  'ل',
  'تصنيع',
  'الدبابة',
  'الالمانية',
  '"',
  'ليوبارد',
  '2',
  '"',
  'عبر',
  'شراء',
  'المجموعة',
  'الحكومية',
  'الاسبانية',
  'ل',
  'الأسلحة',
  '"',
  'سانتا',
  'بربارة',
  '"',
  '.'],
 ['و',
  'في',
  'نيسان',
  '/',
  'ابريل',
  'الماضي',
  '،',
  'تخلت',
  'الدولة',
  'الاسبانية',
  'عن',
  'مجموعة',
  '"',
  'سانتا',
  'بربارة',
  '"',
  'التي',
  'تصنع',
  'دبابات',
  'ليوبارد',
  'الالمانية',
  '،',
  'الى',
  '"',
  'جنرال',
  'ديناميكس',
  '"',
  'التي',
  'تنت

In [9]:
pos_tags

[['X',
  'VERB',
  'NOUN',
  'NOUN',
  'ADJ',
  'ADP',
  'NOUN',
  'NOUN',
  'NOUN',
  'PUNCT',
  'X',
  'PUNCT',
  'ADJ'],
 ['X',
  'NUM',
  'PUNCT',
  'NUM',
  'PUNCT',
  'X',
  'X',
  'PUNCT',
  'PUNCT',
  'VERB',
  'NOUN',
  'NOUN',
  'ADJ',
  'PUNCT',
  'X',
  'X',
  'X',
  'PUNCT',
  'ADP',
  'NOUN',
  'PRON',
  'ADJ',
  'NOUN',
  'PUNCT',
  'SCONJ',
  'NOUN',
  'X',
  'X',
  'VERB',
  'NOUN',
  'NOUN',
  'ADJ',
  'PUNCT',
  'X',
  'X',
  'PUNCT',
  'ADP',
  'NOUN',
  'ADP',
  'NOUN',
  'NOUN',
  'ADJ',
  'PUNCT',
  'X',
  'NUM',
  'PUNCT',
  'ADP',
  'NOUN',
  'NOUN',
  'ADJ',
  'ADJ',
  'ADP',
  'NOUN',
  'PUNCT',
  'X',
  'X',
  'PUNCT',
  'PUNCT'],
 ['CCONJ',
  'ADP',
  'NOUN',
  'PUNCT',
  'NOUN',
  'ADJ',
  'PUNCT',
  'VERB',
  'NOUN',
  'ADJ',
  'ADP',
  'NOUN',
  'PUNCT',
  'X',
  'X',
  'PUNCT',
  'DET',
  'VERB',
  'NOUN',
  'X',
  'ADJ',
  'PUNCT',
  'ADP',
  'PUNCT',
  'X',
  'X',
  'PUNCT',
  'X',
  'VERB',
  'NOUN',
  'ADJ',
  'PUNCT',
  'X',
  'NUM',
  'X',
  'PUNC

In [12]:
sentences_train, sentences_test, pos_tags_train, pos_tags_test = train_test_split(sentences, pos_tags, test_size=0.1, random_state=42)

### Cleaning and Preprocessing

In [13]:
token_counter = Counter()
for tokens in sentences_train:
    token_counter.update(tokens)
token_to_index = {token: idx + 2 for idx, token in enumerate(token_counter)}
token_to_index['<PAD>'] = 0
token_to_index['<UNK>'] = 1

In [14]:
token_counter

Counter({'تم': 170,
         'تشكيل': 47,
         'فريق': 25,
         'عمل': 73,
         'مصري': 38,
         'ل': 5047,
         'دراسة': 30,
         'المشروع': 78,
         'ب': 4438,
         'رئاسة': 37,
         'د': 50,
         '.': 5480,
         'إسماعيل': 9,
         'عبد': 92,
         'الجليل': 1,
         'رئيس': 356,
         'مركز': 61,
         'بحوث': 5,
         'الصحراء': 10,
         'و': 11757,
         'عضوية': 8,
         'المهندس': 31,
         'محمد': 124,
         'الشحات': 2,
         'هيئة': 55,
         'تنمية': 31,
         'بحيرة': 7,
         'السد': 6,
         'العالي': 8,
         'عدد': 220,
         'كبير': 65,
         'من': 3885,
         'خبراء': 30,
         'وزارة': 114,
         'الري': 6,
         'الموارد': 27,
         'المائية': 11,
         'مراكز': 25,
         'البحوث': 14,
         'الجامعات': 12,
         'يمثل': 37,
         'الجانب': 52,
         'الياباني': 11,
         'في': 5304,
         'خالد': 19,
         'زيد': 12,
     

In [15]:
token_to_index

{'تم': 2,
 'تشكيل': 3,
 'فريق': 4,
 'عمل': 5,
 'مصري': 6,
 'ل': 7,
 'دراسة': 8,
 'المشروع': 9,
 'ب': 10,
 'رئاسة': 11,
 'د': 12,
 '.': 13,
 'إسماعيل': 14,
 'عبد': 15,
 'الجليل': 16,
 'رئيس': 17,
 'مركز': 18,
 'بحوث': 19,
 'الصحراء': 20,
 'و': 21,
 'عضوية': 22,
 'المهندس': 23,
 'محمد': 24,
 'الشحات': 25,
 'هيئة': 26,
 'تنمية': 27,
 'بحيرة': 28,
 'السد': 29,
 'العالي': 30,
 'عدد': 31,
 'كبير': 32,
 'من': 33,
 'خبراء': 34,
 'وزارة': 35,
 'الري': 36,
 'الموارد': 37,
 'المائية': 38,
 'مراكز': 39,
 'البحوث': 40,
 'الجامعات': 41,
 'يمثل': 42,
 'الجانب': 43,
 'الياباني': 44,
 'في': 45,
 'خالد': 46,
 'زيد': 47,
 'هو': 48,
 'الجنسية': 49,
 'يعد': 50,
 'أحد': 51,
 'الخبراء': 52,
 'الدوليين': 53,
 'العاملين': 54,
 'مجال': 55,
 'تكنولوجيا': 56,
 'الاتصالات': 57,
 'اليابان': 58,
 'منذ': 59,
 'ثلاثين': 60,
 'عاما': 61,
 'كلفت': 62,
 'ه': 63,
 'إحدى': 64,
 'الشركات': 65,
 'اليابانية': 66,
 'العملاقة': 67,
 'التفاوض': 68,
 'مع': 69,
 'مصر': 70,
 'تنفيذ': 71,
 'الذي': 72,
 'س': 73,
 'يتم': 74,
 'تمويل':

In [17]:
tag_counter = Counter(tag for tags in pos_tags_train for tag in tags)
tag_to_index = {tag: idx + 1 for idx, tag in enumerate(tag_counter)}
tag_to_index['<PAD>'] = 0

In [21]:
tag_to_index

{'VERB': 1,
 'NOUN': 2,
 'ADJ': 3,
 'ADP': 4,
 'X': 5,
 'PUNCT': 6,
 'CCONJ': 7,
 'PRON': 8,
 'NUM': 9,
 'DET': 10,
 'AUX': 11,
 'PART': 12,
 'ADV': 13,
 'SCONJ': 14,
 'SYM': 15,
 'PROPN': 16,
 'INTJ': 17,
 '<PAD>': 0}

In [33]:
max_len = max(max(len(s) for s in sentences_train), max(len(s) for s in sentences_test))
X_train = [[token_to_index.get(token, token_to_index['<UNK>']) for token in tokens] for tokens in sentences_train]
X_train = np.array([np.pad(x, (0, max_len - len(x)), mode='constant') for x in X_train])
y_train = [[tag_to_index[tag] for tag in tags] for tags in pos_tags_train]
y_train = np.array([np.pad(t, (0, max_len - len(t)), mode='constant') for t in y_train])
#y_train = y_train[..., np.newaxis]

X_test = [[token_to_index.get(token, token_to_index['<UNK>']) for token in tokens] for tokens in sentences_test]
X_test = np.array([np.pad(x, (0, max_len - len(x)), mode='constant') for x in X_test])
y_test = [[tag_to_index[tag] for tag in tags] for tags in pos_tags_test]
y_test = np.array([np.pad(t, (0, max_len - len(t)), mode='constant') for t in y_test])
#y_test = y_test[..., np.newaxis]

In [34]:
y_test

array([[ 7,  1,  2, ...,  0,  0,  0],
       [ 4,  2, 12, ...,  0,  0,  0],
       [ 7,  1,  2, ...,  0,  0,  0],
       ...,
       [ 2,  2,  3, ...,  0,  0,  0],
       [ 7,  1,  2, ...,  0,  0,  0],
       [ 1,  2,  3, ...,  0,  0,  0]])

### Modelling

In [35]:
embedding_dim = 128
hidden_units = 64
num_tags = len(tag_to_index)

In [36]:
model = Sequential([
    Embedding(input_dim=len(token_to_index), output_dim=embedding_dim, input_length=max_len, mask_zero=True),
    Bidirectional(LSTM(units=hidden_units, return_sequences=True, dropout=0.5, recurrent_dropout=0.5)),
    TimeDistributed(Dense(units=num_tags, activation='softmax'))
])



In [41]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 398, 128)          2675328   
                                                                 
 bidirectional (Bidirectiona  (None, 398, 128)         98816     
 l)                                                              
                                                                 
 time_distributed (TimeDistr  (None, 398, 18)          2322      
 ibuted)                                                         
                                                                 
Total params: 2,776,466
Trainable params: 2,776,466
Non-trainable params: 0
_________________________________________________________________


In [42]:
history = model.fit(X_train, y_train, batch_size=256, epochs=4, validation_split=0.1, verbose=1)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


#### Evaluation

**Evaluation metric:**
Accuracy

In [43]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=1)
print(f'Test Accuracy: {accuracy:.4f}')

Test Accuracy: 0.6032


### Enhancement

### Conclusion and final results


#### Done!