# Part-of-Speech Tagging for Sindhi Language

## Introduction

This notebook demonstrates the process of training and evaluating a Part-of-Speech (POS) tagger for the Sindhi language using Conditional Random Fields (CRF).

Sindhi is an Indo-Aryan language spoken primarily in the Sindh province of Pakistan and parts of India. It has a rich morphological structure and a unique script, which presents interesting challenges for natural language processing tasks.

## Objectives

1. Preprocess and prepare Sindhi language data for POS tagging
2. Extract relevant features from Sindhi text
3. Train a CRF model for POS tagging
4. Evaluate the model's performance
5. Demonstrate inference on new Sindhi sentences

## Dataset

We use a custom dataset of Sindhi text, manually annotated with POS tags. The data is stored in a CSV file named 'output.csv', containing tokenized Sindhi text with corresponding linguistic features and POS tags.

## Methodology

We employ Conditional Random Fields (CRF), a statistical modeling method often used for structured prediction tasks like POS tagging. CRFs are particularly suitable for Sindhi due to their ability to capture context and handle rich feature sets.

## Libraries Used

- pandas: For data manipulation and analysis
- pycrfsuite: For implementing Conditional Random Fields
- scikit-learn: For evaluation metrics (optional, if used)

## Note

POS tagging for Sindhi presents unique challenges due to the language's morphological complexity and the limited availability of large, annotated datasets. This project aims to contribute to the growing field of NLP for less-resourced languages.

Let's begin by importing necessary libraries and loading our data!

In [53]:
import pandas as pd
import pycrfsuite

## Load and Examine Data

In this section, we load the Sindhi POS dataset from a CSV file and perform an initial examination to understand its structure and content. The dataset contains information for part-of-speech tagging and includes various features such as `FORM`, `LEMMA`, `UPOS`, `XPOS`, `FEATS`, and `Pronunciationa.describe()


In [54]:
# Load data
data = pd.read_csv('SindhiPosDataset.csv')
data.head()

Unnamed: 0,ID,FORM,LEMMA,UPOS,XPOS,FEATS,Pronunciation
0,1,يقين,يقين,NOUN,NN,XPOS =,yaqeen
1,2,ڪرڻ,ڪر,VERB,VB,XPOS =,karan
2,3,سان,سان,ADP,IN,XPOS =,Saan
3,4,اڪثر,اڪثر,ADV,RB,XPOS =,Aksar
4,5,ڌوڪو,ڌوڪو,NOUN,NN,XPOS =,dhoko


### Feature Extraction Function

This cell defines the function `extract_features` used for extracting features from tokens in a sequence, which is essential for tasks such as Part-of-Speech (POS) tagging. The function creates features for a given token based on its position within the sequence.

#### Function Definition: `extract_features(tokens, idx)`

- **Parameters**:
  - `tokens`: A list of dictionaries where each dictionary represents a token with its attributes.
  - `idx`: The index of the current token for which features are being extracted.

- **Returns**:
  - A dictionary of features for the token at the specified index.

#### Features Extracted:
- **Current Word**: The word of the token at the given index.
- **Previous Word**: The word of the token preceding the current token, if not at the start of the sequence.
- **Next Word**: The word of the token following the current token, if not at the end of the sequence.
- **Beginning of Sentence (BOS)**: A flag indicating the start of the sequence.
- **End of Sentence (EOS)**: A flag indicating the end of the sequence.

In [55]:
# Define function to extract features from a token
def extract_features(tokens, idx):
    token = tokens[idx]
    features = {
        'word': token['FORM'],
    }
    
    if idx > 0:
        prev_token = tokens[idx - 1]
        features.update({
            '-1:word': prev_token['FORM'],
        })
    else:
        features['BOS'] = True  # Beginning of sentence

    if idx < len(tokens) - 1:
        next_token = tokens[idx + 1]
        features.update({
            '+1:word': next_token['FORM'],
        })
    else:
        features['EOS'] = True  # End of sentence

    return features

### Group Data into Sentences

This cell processes the dataset to group tokens into their respective sentences. The dataset is assumed to be in a format where each row represents a token with an associated sentence ID. The code iterates through the dataset, collecting tokens into sentences based on their IDs.

#### Code Explanation

- **Initialization**:
  - `sentences`: An empty list to store the sentences after grouping.
  - `current_sentence`: An empty list to collect tokens for the current sentence being processed.

- **Processing**:
  - The code iterates through each row of the dataset.
  - For each row, it checks if the sentence ID (`row['ID']`) indicates the start of a new sentence (`ID == 1`) and if there are tokens in `current_sentence`. If true, it appends `current_sentence` to `sentences` and resets `current_sentence` for the new sentence.
  - The current token is then added to `current_sentence`.

- **Final Check**:
  - After the loop, if there are remaining tokens in `current_sentence`, they are added to `sentences`.

#### Result

- `sentences`: A list of lists, where each inner list contains tokens representin.iterrows():
    if row['ID']


In [56]:
# Group data into sentences
sentences = []
current_sentence = []

for _, row in data.iterrows():
    if row['ID'] == 1 and current_sentence:
        sentences.append(current_sentence)
        current_sentence = []
    current_sentence.append(row)

if current_sentence:
    sentences.append(current_sentence)

### Extract Features and Labels

In this cell, we prepare the dataset for model training by extracting features and labels from the preprocessed sentences. This step is crucial for training a sequence labeling model, such as a Conditional Random Field (CRF) model, for part-of-speech (POS) tagging.

#### Code Explanation

- **Initialization**:
  - `X`: An empty list to hold the feature sets for each token in each sentence.
  - `y`: An empty list to hold the POS tags for each token in each sentence.

- **Processing**:
  - **Iteration Over Sentences**:
    - For each sentence in the `sentences` list:
      - **Feature Extraction**:
        - `extract_features(sentence, idx)`: The `extract_features` function is called for each token in the sentence. It generates a feature dictionary based on the current token and its neighboring tokens (previous and next).
        - This produces a list of feature dictionaries, where each dictionary corresponds to a token's features.
      - **Label Extraction**:
        - `[token['UPOS'] for token in sentence]`: Extracts the Universal Part-Of-Speech (UPOS) tag for each token in the sentence.

- **Result**:
  - `X`: A list of feature lists, where each sublist corresponds to a sentence and contains dictionaries of token features.
  - `y`: A list of label lists, where each sublist corresponds to a sentence and contains the UPOS tags for the tokens in that sentence.

In [57]:
# Extract features and labels
X = []
y = []

for sentence in sentences:
    X.append([extract_features(sentence, idx) for idx in range(len(sentence))])
    y.append([token['UPOS'] for token in sentence])

This Python code defines a function `convert_features` to transform feature data into a format suitable for the PyCRFSuite library. PyCRFSuite expects features in a specific dictionary-based format for training and prediction.

1. **Function Definition:**
   * `def convert_features(X):` defines a function named `convert_features` that takes a list of feature dictionaries `X` as input.

2. **Feature Conversion:**
   * `[{k: str(v) for k, v in x.items()} for x in X]` is a list comprehension that iterates over each feature dictionary `x` in the input list `X`.
   * For each feature dictionary `x`, it creates a new dictionary where the keys and values are converted to strings using `str(v)`. This is necessary because PyCRFSuite requires string-based features.
   * The resulting list of converted feature dictionaries is returned.

**Output:**

The function returns a list of feature dictionaries, where each dictionary represents a data point and its features are converted to strings. This format is compatible with PyCRFSuite's input requirements.

In [58]:
# Convert features to the format required by pycrfsuite
def convert_features(X):
    return [{k: str(v) for k, v in x.items()} for x in X]

This Python code snippet demonstrates the basic structure for training a Conditional Random Field (CRF) model using the PyCRFSuite library. It involves creating a trainer instance, appending training data, and implicitly performing the training process.

1. **Import:**
   * `pycrfsuite.Trainer` is imported to create a CRF trainer object.

2. **Trainer Initialization:**
   * `trainer = pycrfsuite.Trainer(verbose=False)` creates a CRF trainer instance. The `verbose=False` argument suppresses training progress messages.

3. **Data Appending:**
   * The `for` loop iterates over pairs of feature sequences (`xseq`) and label sequences (`yseq`).
   * `trainer.append(convert_features(xseq), yseq)` adds each feature-label pair to the trainer's dataset. The `convert_features` function (assumed to be defined elsewhere) transforms feature sequences into a format suitable for PyCRFSuite.

**Key Points:**

* The `Trainer` object accumulates training data in memory.
* The `convert_features` function is essential for preparing feature data in the correct format.
* The code snippet focuses on data preparation and appending, omitting the actual model training and saving steps.

**Additional Considerations:**

* **Model Training:** After appending all training data, you would typically call the `trainer.train(model_file)` method to train the CRF model and save it to a file.
* **Feature Engineering:** The quality of the extracted features significantly impacts the model's performance.
* **Hyperparameter Tuning:** PyCRFSuite offers various parameters to control the training process, which can be fine-tuned for optimal results.
* **Evaluation:** Once the model is trained, it can be used to make predictions on new data and evaluated using appropriate metrics.


In [59]:
trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(X, y):
    trainer.append(convert_features(xseq), yseq)


This Python code snippet demonstrates how to set hyperparameters for a CRF model using the PyCRFSuite library before training.

1. **Import:**
   - `pycrfsuite` is imported to access the `Trainer` class.

2. **Trainer Initialization:**
   * A `Trainer` object is created without specifying parameters initially.

3. **Setting Parameters:**
   * `trainer.set_params()` is used to configure the CRF model's hyperparameters:
     - `c1`: L1 regularization parameter (coefficient)
     - `c2`: L2 regularization parameter (coefficient)
     - `max_iterations`: Maximum number of iterations for training
     - `feature.possible_transitions`: Enables the inclusion of transition features in the model

4. **Training:**
   * `trainer.train('model.crfsuite')` initiates the model training process using the specified parameters and saves the trained model to 'model.crfsuite'.

**Explanation of Parameters:**

* **c1 (L1 penalty):** Controls the L1 regularization strength. Higher values induce sparsity in the model, potentially leading to feature selection.
* **c2 (L2 penalty):** Controls the L2 regularization strength. It helps prevent overfitting by discouraging large weights.
* **max_iterations:** Specifies the maximum number of iterations for the training algorithm.
* **feature.possible_transitions:** Enables the inclusion of transition features, which capture dependencies between consecutive labels.

**Key Points:**

* Proper hyperparameter tuning is crucial for CRF model performance.
* Experimentation with different parameter values is often necessary to find the optimal configuration.
* Other hyperparameters might be available depending on the chosen training algorithm.

**Additional Considerations:**

* Consider using cross-validation or grid search to find the best hyperparameter combination.
* Explore other regularization techniques or optimization algorithms offered by PyCRFSuite.
* Evaluate the trained model's performance using appropriate metrics.

By understanding these parameters and their impact, you can effectively fine-tune your CRF model for better performance on your specific task.

In [60]:
trainer.set_params({
    'c1': 1.0,  # Coefficient for L1 penalty
    'c2': 1e-3,  # Coefficient for L2 penalty
    'max_iterations': 50,  # Maximum number of iterations
    'feature.possible_transitions': True
})

# Train the model
trainer.train('sindhiposmodel.crfsuite')

This Python code snippet demonstrates how to load a pre-trained CRF model using the `pycrfsuite.Tagger` class.


1. **Import:**
   - `pycrfsuite.Tagger` is imported to create a tagger object for prediction.

2. **Tagger Creation:**
   - `tagger = pycrfsuite.Tagger()` creates a tagger object.

3. **Model Loading:**
   - `tagger.open('model.crfsuite')` loads the pre-trained CRF model from the file 'model.crfsuite' into the tagger object.

**Key Points:**

* The loaded model can be used to make predictions on new data.
* The model file format is specific to PyCRFSuite and might not be compatible with other libraries.


In [61]:
tagger = pycrfsuite.Tagger()
tagger.open('sindhiposmodel.crfsuite')

<contextlib.closing at 0x226986a2cf0>

This Python code defines a function `evaluate_model` to assess the performance of a CRF model on a given dataset.

1. **Function Definition:**
   * `def evaluate_model(X, y, tagger):` defines a function named `evaluate_model` that takes three arguments:
     - `X`: A list of feature sequences.
     - `y`: A list of corresponding label sequences.
     - `tagger`: A trained CRF model instance.

2. **Prediction:**
   * `y_pred = []` initializes an empty list to store predicted label sequences.
   * The `for` loop iterates over each feature sequence `xseq` in `X`.
   * For each `xseq`, the `tagger.tag(convert_features(xseq))` method is used to obtain the predicted label sequence, which is appended to `y_pred`.

3. **Accuracy Calculation:**
   * `correct` and `total` are initialized to 0 to keep track of correct and total predictions.
   * The outer `for` loop iterates over pairs of true label sequences (`yseq`) and predicted label sequences (`yseq_pred`).
   * The inner `for` loop compares each predicted label (`y_hat`) with the corresponding true label (`y_true`).
   * If the predicted label matches the true label, `correct` is incremented.
   * `total` is incremented for each label comparison.

4. **Accuracy Calculation:**
   * The final accuracy is calculated by dividing the number of correct predictions (`correct`) by the total number of predictions (`total`).

5. **Return Value:**
   * The function returns the calculated accuracy as a floating-point value.

**Key Points:**

* The function assumes the availability of the `convert_features` function for feature conversion.
* The accuracy metric used is simple accuracy, which might not be suitable for imbalanced datasets or other evaluation scenarios.
* More sophisticated evaluation metrics like precision, recall, F1-score, or confusion matrices can be incorporated for a more comprehensive evaluation.

**Improvements:**

* Consider using built-in evaluation functions from PyCRFSuite or other libraries for efficiency and potential additional metrics.
* Implement more robust evaluation metrics like precision, recall, and F1-score to provide a more comprehensive assessment of the model's performance.
* Handle potential errors or exceptions that might occur during prediction.

By understanding this code and incorporating potential improvements, you can effectively evaluate the performance of your CRF model.

In [62]:
# Function to evaluate model
def evaluate_model(X, y, tagger):
    y_pred = []
    for xseq in X:
        y_pred.append(tagger.tag(convert_features(xseq)))
    
    correct = 0
    total = 0
    for yseq, yseq_pred in zip(y, y_pred):
        for y_true, y_hat in zip(yseq, yseq_pred):
            if y_true == y_hat:
                correct += 1
            total += 1
    
    accuracy = correct / total
    return accuracy

This code snippet demonstrates how to evaluate a CRF model and print the accuracy score.

1. **Model Evaluation:**
   * `accuracy = evaluate_model(X, y, tagger)` calls the `evaluate_model` function (defined previously) to calculate the accuracy of the CRF model on the given test data `X` and `y`, using the trained `tagger`.

2. **Printing Accuracy:**
   * `print(f'Accuracy: {accuracy:.4f}')` prints the calculated accuracy with four decimal places.

**Key Points:**

* The `evaluate_model` function is assumed to be defined as in the previous code snippet.
* The code focuses on the final step of evaluating the model's performance.
* The formatted printing provides a clear and readable output.

**Additional Considerations:**

* Consider using other evaluation metrics beyond accuracy, such as precision, recall, and F1-score.
* For more complex evaluation scenarios, you might need to calculate these metrics manually or use specialized evaluation libraries.

By combining this code with the previously provided `evaluate_model` function, you can effectively assess the performance of your CRF model.

In [63]:
# Evaluate the model
accuracy = evaluate_model(X, y, tagger)
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 0.9645
