# **Identifying Key Entities in Recipe Data**


**Business Objective**:
The goal of this assignment is to train a Named Entity Recognition (NER) model using Conditional Random Fields (CRF) to extract key entities from recipe data. The model will classify words into predefined categories such as ingredients, quantities and units, enabling the creation of a structured database of recipes and ingredients that can be used to power advanced features in recipe management systems, dietary tracking apps, or e-commerce platforms.

### **Data Description**
The given data is in JSON format, representing a **structured recipe ingredient list** with **Named Entity Recognition (NER) labels**. Below is a breakdown of the data fields:

```json
[
    {
        "input": "6 Karela Bitter Gourd Pavakkai Salt 1 Onion 3 tablespoon Gram flour besan 2 teaspoons Turmeric powder Haldi Red Chilli Cumin seeds Jeera Coriander Powder Dhania Amchur Dry Mango Sunflower Oil",
        "pos": "quantity ingredient ingredient ingredient ingredient ingredient quantity ingredient quantity unit ingredient ingredient ingredient quantity unit ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient"
    },
    {
      "input": "2-1/2 cups rice cooked 3 tomatoes teaspoons BC Belle Bhat powder 1 teaspoon chickpea lentils 1/2 cumin seeds white urad dal mustard green chilli dry red 2 cashew or peanuts 1-1/2 tablespoon oil asafoetida",
      "pos": "quantity unit ingredient ingredient quantity ingredient unit ingredient ingredient ingredient ingredient quantity unit ingredient ingredient quantity ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient ingredient quantity ingredient ingredient ingredient quantity unit ingredient ingredient"
    }
]


| **Key**  | **Description**  |
|----------|-----------------|
| `input`  | Contains a raw ingredient list from a recipe. |
| `pos`    | Represents the corresponding part-of-speech (POS) tags or NER labels, identifying quantities, ingredients, and units. |


## **1** Import libraries

#### **1.1** Installation of sklearn-crfsuite

sklearn-crfsuite is a Python wrapper for CRFsuite, a fast and efficient implementation of Conditional Random Fields (CRFs). It is designed to integrate seamlessly with scikit-learn for structured prediction tasks such as Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and chunking.

In [None]:
# installation of sklearn_crfsuite
!pip install sklearn_crfsuite==0.5.0

Collecting sklearn_crfsuite==0.5.0
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn_crfsuite==0.5.0)
  Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite, sklearn_crfsuite
Successfully installed python-crfsuite-0.9.11 sklearn_crfsuite-0.5.0


#### **1.2** Import necessary libraries

In [None]:
# Import warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import necessary libraries
import json  # For handling JSON data
import pandas as pd  # For data manipulation and analysis
import re  # For regular expressions (useful for text preprocessing)
import matplotlib.pyplot as plt  # For visualisation
import seaborn as sns  # For advanced data visualisation
import sklearn_crfsuite  # CRF (Conditional Random Fields) implementation for sequence modeling
import numpy as np  # For numerical computations
# Saving and loading machine learning models
import joblib
import random
import spacy
from IPython.display import display, Markdown # For displaying well-formatted output

from fractions import Fraction  # For handling fractional values in numerical data
# Importing tools for feature engineering and model training
from collections import Counter  # For counting occurrences of elements in a list
from sklearn.model_selection import train_test_split  # For splitting dataset into train and test sets
from sklearn_crfsuite import metrics  # For evaluating CRF models
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.utils.class_weight import compute_class_weight
from collections import Counter
from sklearn.metrics import confusion_matrix

In [None]:
# Ensure pandas displays full content
pd.set_option('display.max_colwidth', None)
pd.set_option('display.expand_frame_repr', False)

## **2** Data Ingestion and Preparation <font color = red>[25 marks]</font> <br>

#### **2.1** *Read Recipe Data from Dataframe and prepare the data for analysis* <font color = red>[12 marks]</font> <br>
Read the data from JSON file, print first five rows and describe the dataframe

##### **2.1.1** **Define a *load_json_dataframe* function** <font color = red>[7 marks]</font> <br>

Define a function that takes path of the ingredient_and_quantity.json file and reads it, convert it into dataframe - df and return it.

In [None]:
# define a function to load json file to a dataframe
def load_json_dataframe(path):
  json_data_df = pd.read_json(path)
  return json_data_df

##### **2.1.2** **Execute the *load_json_dataframe* function** <font color = red>[2 marks]</font> <br>

In [None]:
# read the json file by giving the file path and create a dataframe
data_path = './ingredient_and_quantity.json'
data_df = load_json_dataframe(data_path)

##### **2.1.3** **Describe the dataframe** <font color = red>[3 marks]</font> <br>

Print first five rows of dataframe along with dimensions. Display the information of dataframe

In [None]:
# display first five rows of the dataframe - df
data_df.head()

In [None]:
# print the dimensions of dataframe - df
print("Dimensions of the dataframe:", data_df.shape)

In [None]:
# print the information of the dataframe
data_df.info()

#### **2.2** *Recipe Data Manipulation* <font color = red>[13 marks]</font> <br>
Create derived metrics in dataframe and provide insights of the dataframe

##### **2.2.1** **Create input_tokens and pos_tokens columns by splitting the input and pos from the dataframe** <font color = red>[3 marks]</font> <br>
Split the input and pos into input_tokens and pos_tokens in the dataframe and display it in the dataframe

In [None]:
# split the input and pos into input_tokens and pos_tokens in the dataframe
# Tokenize input
data_df['input_tokens'] = data_df['input'].apply(lambda x: x.split())
# Tokenize POS
data_df['pos_tokens'] = data_df['pos'].apply(lambda x: x.split())

In [None]:
# display first five rows of the dataframe - df
data_df.head()

##### **2.2.2** **Provide the length for input_tokens and pos_tokens and validate their length** <font color = red>[2 marks]</font> <br>

Create input_length and pos_length columns in the dataframe and validate both the lengths. Check for the rows that are unequal in input and pos length


In [None]:
# create input_length and pos_length columns for the input_tokens and pos-tokens
data_df['input_length'] = data_df['input_tokens'].apply(len)
data_df['pos_length'] = data_df['pos_tokens'].apply(len)

In [None]:
# check for the equality of input_length and pos_length in the dataframe
data_df['is_length_equal'] = data_df['input_length'] == data_df['pos_length']
display(data_df['is_length_equal'].value_counts())

##### **2.2.3** **Define a unique_labels function and validate the labels in pos_tokens** <font color = red>[2 marks]</font> <br>

Define a unique_labels function which checks for all the unique pos labels in the recipe & execute it.


In [None]:
# Define a unique_labels function to checks for all the unique pos labels in the recipe & print it
def unique_labels(df):
  unique_labels = set()
  for pos_tokens in df['pos_tokens']:
    unique_labels.update(pos_tokens)
  return unique_labels

all_unique_labels = unique_labels(data_df)
print("Unique Labels:", all_unique_labels)

##### **2.2.3** **Provide the insights seen in the recipe data after validation** <font color = red>[1 marks]</font> <br>

Provide the indexes that requires cleaning and formatting in the dataframe

In [None]:
unequal_length_indexes = data_df[data_df['is_length_equal'] == False].index.tolist()
print("Indexes requiring cleaning:", unequal_length_indexes)

##### **2.2.4** **Drop the rows that have invalid data provided in previous cell** <font color = red> [2 marks]</font> <br>

In [None]:
# drop the irrelevant recipe data
data_df = data_df.drop(unequal_length_indexes)

# Display the dimensions of the dataframe after dropping the rows
print("Dimensions of the dataframe after dropping rows:", data_df.shape)

##### **2.2.5** **Update the input_length & pos_length in dataframe**<font color = red> [2 marks]</font> <br>

In [None]:
# update the input and pos length in input_length and pos_length
data_df['input_length'] = data_df['input_tokens'].apply(len)
data_df['pos_length'] = data_df['pos_tokens'].apply(len)

##### **2.2.6** **Validate the input_length and pos_length by checking unequal rows** <font color = red> [1 marks]</font> <br>

In [None]:
# validate the input length and pos length as input_length and pos_length
data_df['is_length_equal'] = data_df['input_length'] == data_df['pos_length']
data_df['is_length_equal'].value_counts()

## **3** Train Validation Split (70 train - 30 val) <font color = red>[6 marks]</font> <br>

#### **3.1** *Perform train and validation split ratio* <font color = red>[6 marks]</font> <br>
Split the dataset with the help of input_tokens and pos_tokens and make a ratio of 70:30 split for training and validation datasets.

###### **3.1.1** **Split the dataset into train_df and val_df into 70:30 ratio** <font color = red> [1 marks]</font> <br>

In [None]:
# split the dataset into training and validation sets
test_data_size = 0.3
train_df, val_df = train_test_split(data_df, test_size=test_data_size, random_state=42)

###### **3.1.2** **Print the first five rows of train_df and val_df** <font color = red> [1 marks]</font> <br>

In [None]:
# print the first five rows of train_df
train_df.head()

In [None]:
# print the first five rows of the val_df
val_df.head()

###### **3.1.3** **Extract the dataset into train_df and val_df into X_train, X_val, y_train and y_val and display their length** <font color = red> [2 marks]</font> <br>

Extract X_train, X_val, y_train and y_val by extracting the list of input_tokens and pos_tokens from train_df and val_df and also display their length

In [None]:
# extract the training and validation sets by taking input_tokens and pos_tokens
X_train = train_df['input_tokens']
y_train = train_df['pos_tokens']
X_val = val_df['input_tokens']
y_val = val_df['pos_tokens']

In [None]:
# validate the shape of training and validation samples
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of y_val:", y_val.shape)

###### **3.1.4** **Display the number of unique labels present in y_train** <font color = red> [2 marks]</font> <br>

In [None]:
# Display the number of unique labels present in y_train
unique_labels = set()
for pos_tokens in y_train:
  unique_labels.update(pos_tokens)
print("Number of unique labels in y_train:", len(unique_labels))

## **4** Exploratory Recipe Data Analysis on Training Dataset <font color = red>[16 marks]</font> <br>

#### **4.1** *Flatten the lists for input_tokens & pos_tokens* <font color = red>[2 marks]</font> <br>

Define a function **flatten_list** for flattening the structure for input_tokens and pos_tokens. The input parameter passed to this function is a nested list.

Initialise the dataset_name with a value ***'Training'***




In [None]:
# flatten the list for nested_list (input_tokens, pos_tokens)
def flatten_list(nested_list):
    flattened = [item for sublist in nested_list for item in sublist]
    return flattened

In [None]:
# initialise the dataset_name
dataset_name = 'Training'

#### **4.2** *Extract and validate the tokens after using the flattening technique* <font color = red>[2 marks]</font> <br>

Define a function named ***extract_and_validate_tokens*** with parameters dataframe and dataset_name (Training/Validation), validate the length of input_tokens and pos_tokens from dataframe and display first 10 records for both the input_tokens and pos_tokens. Execute this function




In [None]:
# define a extract_and_validate_tokens with parameters (df, dataset_name)
# call the flatten_list and apply it on input_tokens and pos_tokens
# validate their length and display first 10 records having input and pos tokens
def extract_and_validate_tokens(df, dataset_name):
    input_tokens = flatten_list(df['input_tokens'])
    pos_tokens = flatten_list(df['pos_tokens'])
    if len(input_tokens) == len(pos_tokens):
        print(f"\nFirst 10 records in {dataset_name} dataset:")
        print("Input Tokens:")
        print(input_tokens[:10])
        print("\nPOS Tokens:")
        print(pos_tokens[:10])
        return input_tokens, pos_tokens
    else:
        print("Input and POS tokens have different lengths.")
        return None, None

In [None]:
# extract the tokens and its pos tags
train_ingredients, train_units = extract_and_validate_tokens(train_df, dataset_name)

#### **4.3** *Categorise tokens into labels (unit, ingredient, quantity)* <font color = red>[2 marks]</font> <br>

Define a function ***categorize_tokens*** to categorise tokens into ingredients, units and quantities by using extracted tokens in the previous code and return a list of ingredients, units and quantities. Execute this function to get the list.



In [None]:
# define a categorize_tokens function and provide the tokens and pos_tags as parameters and create ingredient, unit and quantity list and return it
# validate the list that it comprised of these labels, if not return empty arrays
def categorize_tokens(tokens, pos_tags):
    ingredients = []
    units = []
    quantities = []
    # Check if the lengths of tokens and pos_tags are equal before proceeding
    if len(tokens) != len(pos_tags):
      return [], [], []

    for token, pos in zip(tokens, pos_tags):
        if pos == 'ingredient':
            ingredients.append(token)
        elif pos == 'unit':
            units.append(token)
        elif pos == 'quantity':
            quantities.append(token)
    return ingredients, units, quantities


In [None]:
#  call the function to categorise the labels into respective list
train_ingredients_list, train_units_list, train_quantities_list = categorize_tokens(train_ingredients, train_units)

print("Length of ingredient list:", len(train_ingredients_list))
print("Length of unit list:", len(train_units_list))
print("Length of quantity list:", len(train_quantities_list))

#### **4.4** *Top 10 Most Frequent Items* <font color = red>[3 marks]</font> <br>

Define a function ***get_top_frequent_items*** to display top 10 most frequent items

Here, item_list is used as a general parameter where you will call this function for ingredient and unit list

Execute this function separately for top 10 most units and ingredients



In [None]:
# define a function get_top_frequent_items to get the top frequent items by using item_list, pos label and dataset_name(Training/Validation) and return top items
def get_top_frequent_items(item_list, pos_label, dataset_name):
    item_counts = Counter(item_list)
    top_items = item_counts.most_common(10)
    print(f"\nTop 10 Most Frequent {pos_label.capitalize()}s in {dataset_name} dataset:")
    for item, count in top_items:
        print(f"{item}: {count}")
    return top_items

In [None]:
# get the top ingredients which are frequently seen in the recipe
top_train_ingredients = get_top_frequent_items(train_ingredients_list, 'ingredient', dataset_name)

In [None]:
# get the top units which are frequently seen in the recipe
top_train_units = get_top_frequent_items(train_units_list, 'unit', dataset_name)

#### **4.5** *Plot Top 10 most frequent items* <font color = red>[2 marks]</font> <br>




Define a function ***plot_top_items*** to plot a bar graph on top 10 most frequent items for units and ingredients

Here, item_list is used as a general parameter where you will call this function for ingredient and unit list

In [None]:
# define plot top items with parameters - top_item list, label to suggest whether its ingredient or unit, dataset_name
def plot_top_items(top_items, label, dataset_name):
    items, counts = zip(*top_items)
    plt.figure(figsize=(10, 6))
    sns.barplot(x=list(items), y=list(counts), palette='viridis')
    plt.title(f'Top 10 Most Frequent {label.capitalize()}s in {dataset_name} Dataset')
    plt.xlabel(label.capitalize())
    plt.ylabel('Frequency')
    plt.show()

#### **4.6** *Perform EDA analysis* <font color = red>[5 marks]</font> <br>

Plot the bar plots for ingredients and units and provide the insights for training dataset

---



In [None]:
# plot the top frequent ingredients in training data
plot_top_items(top_train_ingredients,'ingredient',dataset_name)

In [None]:
# plot the top frequent units in training data
plot_top_items(top_train_units,'unit',dataset_name)

## **5** Exploratory Recipe Data Analysis on Validation Dataset (Optional)<font color = red> [0 marks]</font> <br>

#### **5.1** *Execute EDA on Validation Dataset with insights (Optional)* <font color = red> [0 marks]</font> <br>
Initialise the dataset_name as ***Validation*** and call the ***plot_top_items*** for top 10 ingredients and units in the recipe data
Provide the insights for the same.



In [None]:
# initialise the dataset_name
val_dataset_name = 'Validation'

In [None]:
# use extract and validate tokens, categorise tokens, get top frequent items for ingredient list and unit list on validation dataframe
val_ingredients, val_units = extract_and_validate_tokens(val_df, val_dataset_name)

In [None]:
val_ingredients_list, val_units_list, val_quantities_list = categorize_tokens(val_ingredients, val_units)
top_val_ingredients = get_top_frequent_items(val_ingredients_list, 'ingredients', val_dataset_name)
top_val_units = get_top_frequent_items(val_units_list, 'unit', val_dataset_name)

In [None]:
# plot the top frequent ingredients in validation data
plot_top_items(top_val_ingredients,'ingredient',val_dataset_name)

In [None]:
# plot the top frequent units in training data
plot_top_items(top_val_units,'unit',val_dataset_name)

## **6** Feature Extraction For CRF Model <font color = red>[30 marks]</font> <br>

### **6.1** *Define a feature functions to take each token from recipe* <font color = red>[10 marks]</font>

Define a function as ***word2features*** which takes a particular recipe and its index to work with all recipe input tokens and include custom key-value pairs.

Also, use feature key-value pairs to mark the beginning and end of the sequence and to also check whether the word belongs to unit, quantity etc. Use keyword sets for unit and quantity for differentiating feature functions well. Also make use of relevant regex patterns on fractions, whole numbers etc.

##### **6.1.1** **Define keywords for unit and quantity and create a quantity pattern to work on fractions, numbers and decimals** <font color = red>[3 marks]</font> <br>

Create sets for **unit_keywords** and ***quantity_keywords*** and include all the words relevant for measuring the ingredients such as cup, tbsp, tsp etc. and in quantity keywords, include words such as half, quarter etc.

Also suggested to use regex pattern as ***quantity_pattern*** to work with quantity in any format such as fractions, numbers and decimals.

Then, load the spacy model and process the entire sentence

In [None]:
# define unit and quantity keywords along with quantity pattern
unit_keywords = train_units_list
quantity_keywords = train_quantities_list
# matches integers, decimals, and fractions
quantity_pattern = re.compile(r'^\d+(\.\d+)?(/[1-9]\d*)?$')

In [None]:
# load spaCy model
model = spacy.load("en_core_web_sm")

##### **6.1.2** **Define feature functions for CRF** <font color = red>[7 marks]</font> <br>

Define ***word2features*** function and use the parameters such as sentence and its indexing as ***sent*** and ***i*** for extracting token level features for CRF Training.
Build ***features*** dictionary, also mark the beginning and end of the sequence and use the ***unit_keywords***, ***quantity_keywords*** and ***quantity_pattern*** for knowing the presence of quantity or unit in the tokens

While building ***features*** dictionary, include
- ***Core Features*** - The core features of a token should capture its lexical
and grammatical properties. Include attributes like the raw token, its lemma, part-of-speech tag, dependency relation, and shape, as well as indicators for whether it's a stop word, digit, or punctuation. The details of the features are given below:

    - `bias` - Constant feature with a fixed value of 1.0 to aid model learning.
    - `token` - The lowercase form of the current token.
    - `lemma` - The lowercase lemma (base form) of the token.
    - `pos_tag` - Part-of-speech (POS) tag of the token.
    - `tag` - Detailed POS tag of the token.
    - `dep` - Dependency relation of the token in the sentence.
    - `shape` - Shape of the token (e.g., "Xxx" for "Milk").
    - `is_stop` - Boolean indicating if the token is a stopword.
    - `is_digit` - Boolean indicating if the token consists of only digits.
    - `has_digit` - Boolean indicating if the token contains at least one digit.
    - `has_alpha` - Boolean indicating if the token contains at least one alphabetic character.
    - `hyphenated` - Boolean indicating if the token contains a hyphen (-).
    - `slash_present` - Boolean indicating if the token contains a slash (/).
    - `is_title` - Boolean indicating if the token starts with an uppercase letter.
    - `is_upper` - Boolean indicating if the token is fully uppercase.
    - `is_punct` - Boolean indicating if the token is a punctuation mark.

- ***Improved Quantity and Unit Detection*** - Use key-value pairs to mark the presence of quantities and units in the features dictionary. Utilise the unit_keywords, quantity_keywords, and quantity_pattern to identify and flag these elements. The details of the features are given below:

    - `is_quantity` - Boolean indicating if the token matches a quantity pattern or keyword.
    - `is_unit` - Boolean indicating if the token is a known measurement unit.
    - `is_numeric` - Boolean indicating if the token matches a numeric pattern.
    - `is_fraction` - Boolean indicating if the token represents a fraction (e.g., 1/2).
    - `is_decimal` - Boolean indicating if the token represents a decimal number (e.g., 3.14).
    - `preceding_word` - The previous token in the sentence, if available.
    - `following_word` - The next token in the sentence, if available.

- ***Contextual Features*** - Incorporate contextual information by adding features for the preceding and following tokens. Include indicators like BOS and EOS to mark the beginning and end of the sequence, and utilise unit_keywords, quantity_keywords, and quantity_pattern to identify the types of neighboring tokens. The features are given below:

    - `prev_token` - The lowercase form of the previous token.
    - `prev_is_quantity` - Boolean indicating if the previous token is a quantity.
    - `prev_is_digit` - Boolean indicating if the previous token is a digit.
    - `BOS` - Boolean indicating if the token is at the beginning of the sentence.
    - `next_token` - The lowercase form of the next token.
    - `next_is_unit` - Boolean indicating if the next token is a unit.
    - `next_is_ingredient` - Boolean indicating if the next token is not a unit or quantity.
    - `EOS` - Boolean indicating if the token is at the end of the sentence.



In [None]:
# define word2features for processing each token in the sentence sent by using index i.
# use your own feature functions
def word2features(sent, pos):
    word = sent[pos]
    features = {
        # --- Core Features ---
        'bias': 1.0,
        'token': word.lower(),
        'lemma': model(word)[0].lemma_.lower(),
        'pos_tag': model(word)[0].pos_,
        'tag': model(word)[0].tag_,
        'dep': model(word)[0].dep_,
        'shape': re.sub(r'[0-9]', '0', re.sub(r'[A-Z]', 'X', re.sub(r'[a-z]', 'x', word))),
        'is_stop': model(word)[0].is_stop,
        'is_digit': word.isdigit(),
        'has_digit': any(char.isdigit() for char in word),
        'has_alpha': any(char.isalpha() for char in word),
        'hyphenated': '-' in word,
        'slash_present': '/' in word,
        'is_title': word.istitle(),
        'is_upper': word.isupper(),
        'is_punct': model(word)[0].is_punct,

        # --- Improved Quantity & Unit Detection ---
        'is_quantity': word.lower() in quantity_keywords or (quantity_pattern.match(word) is not None), # Ensure match returns boolean
        'is_unit': word.lower() in unit_keywords,
        'is_numeric': word.replace('.', '', 1).isdigit(),
        'is_fraction': '/' in word and (quantity_pattern.match(word) is not None), # Ensure match returns boolean
        'is_decimal': '.' in word and word.replace('.', '', 1).isdigit(),
        'preceding_word': sent[pos - 1] if pos > 0 else '',
        'following_word': sent[pos + 1] if pos < len(sent) - 1 else '',

        # --- Contextual Features ---
        'prev_token': sent[pos - 1].lower() if pos > 0 else '',
        'prev_is_quantity': (sent[pos - 1].lower() in quantity_keywords or (quantity_pattern.match(sent[pos - 1]) is not None)) if pos > 0 else False, # Ensure boolean
        'prev_is_digit': sent[pos - 1].isdigit() if pos > 0 else False, # Ensure boolean
        'BOS': pos == 0,
        'next_token': sent[pos + 1].lower() if pos < len(sent) - 1 else '',
        'next_is_unit': (sent[pos + 1].lower() in unit_keywords if pos < len(sent) - 1 else False), # Ensure boolean
        'next_is_ingredient': ((sent[pos + 1].lower() not in unit_keywords and (sent[pos + 1].lower() not in quantity_keywords and (quantity_pattern.match(sent[pos + 1]) is not None))) if pos < len(sent) - 1 else False), # Ensure boolean
        'EOS': pos == len(sent) - 1
    }

    return features

### **6.2** *Preparation of Recipe level features* <font color = red>[2 marks]</font>


##### **6.2.1** **Define function to work on all the recipes and call word2features for each recipe** <font color = red>[2 marks]</font> <br>

Define ***sent2features*** function and inputs ***sent*** as a parameter and correctly generate feature functions for each token present in the sentence

In [None]:
# define sent2features by working on each token in the sentence and correctly generate dictionaries for features
def sent2features(sent):
  return [word2features(sent, i) for i in range(len(sent))]

### **6.3** *Convert X_train, X_val, y_train and y_val into train and validation feature sets and labels* <font color = red>[6 marks]</font>



##### **6.3.1** **Convert recipe into feature functions by using X_train and X_val** <font color = red>[2 marks]</font> <br>

Create ***X_train_features*** and ***X_val_features*** as list to include the feature functions for each recipe present in training and validation sets

In [None]:
# Convert input sentences into feature sets by taking training and validation dataset as X_train_features and X_val_features
X_train_features = [sent2features(sent) for sent in X_train]
X_val_features = [sent2features(sent) for sent in X_val]

##### **6.3.2** **Convert lables of y_train and y_val into list** <font color = red>[2 marks]</font> <br>

Create ***y_train_labels*** and ***y_val_labels*** by using the list of y_train and y_val

In [None]:
# Convert labels into list as y_train_labels and y_val_labels
y_train_labels = y_train
y_val_labels = y_val

##### **6.3.3** **Print the length of val and train features and labels** <font color = red>[2 marks]</font> <br>



In [None]:
# print the length of train features and labels
print("Length of X_train_features:", len(X_train_features))
print("Length of y_train_labels:", len(y_train_labels))

In [None]:
# print the length of validation features and labels
print("Length of X_val_features:", len(X_val_features))
print("Length of y_val_labels:", len(y_val_labels))

### **6.4** *Applying weights to feature sets* <font color = red>[12 marks]</font> <br>




##### **6.4.1** **Flatten the labels of y_train** <font color = red>[2 marks]</font> <br>

Create ***y_train_flat*** to flatten the structure of nested y_train

In [None]:
# Flatten labels in y_train
y_train_flat = flatten_list(y_train_labels)

##### **6.4.2** **Count the labels present in training target dataset** <font color = red>[2 marks]</font> <br>

Create ***label_counts*** to count the frequencies of labels present in y_train_flat and retrieve the total samples by using the values of label_counts as ***total_samples***

In [None]:
# Count label frequencies as label_counts and total_samples as getting the summation of values of label_counts
label_counts = Counter(y_train_flat)
total_samples = sum(label_counts.values())

##### **6.4.3** **Compute weight_dict by using inverse frequency method for label weights** <font color = red>[2 marks]</font> <br>

- Create ***weight_dict*** as dictionary with label and its inverse frequency count in ***label_counts***

- Penalise ingredient label in the dictionary

In [None]:
# Compute class weights (inverse frequency method) by considering total_samples and label_counts
weight_dict = {label: total_samples / count for label, count in label_counts.items()}

In [None]:
# penalise ingredient label
weight_dict['ingredient'] = 1.0

##### **6.4.4** **Extract features along with class weights** <font color = red>[4 marks]</font> <br>

Define a function ***extract_features_with_class_weights*** to work with training and validation datasets and extract features by applying class weights





In [None]:
# Apply weights to feature extraction in extract_features_with_class_weights by using parameters such as X (input tokens), y(labels) and weight_dict (Class weights)
def extract_features_with_class_weights(X_features, y_labels, weight_dict):
    X_weighted_features = []
    y_weighted_labels = [] # Keep track of labels as well, although they aren't modified

    for sentence_features, labels in zip(X_features, y_labels):
        weighted_sentence_features = []
        for i in range(len(sentence_features)):
            features = sentence_features[i].copy() # Create a copy to avoid modifying the original features
            label = labels[i]
            features['class_weight'] = weight_dict.get(label, 1.0) # Add class weight feature
            weighted_sentence_features.append(features)
        X_weighted_features.append(weighted_sentence_features)
        y_weighted_labels.append(labels)

    return X_weighted_features, y_weighted_labels

##### **6.4.5** **Execute extract_features_with_class_weights on training and validation datasets** <font color = red>[2 marks]</font> <br>

Create ***X_train_weighted_features*** and ***X_val_weighted_features*** for extracting training and validation features along with their weights by calling ***extract_features_with_class_weights*** on the datasets

In [None]:
# Apply manually computed class weights
X_train_weighted_features, y_train_weighted_labels = extract_features_with_class_weights(X_train_features, y_train_labels, weight_dict)
X_val_weighted_features, y_val_weighted_labels = extract_features_with_class_weights(X_val_features, y_val_labels, weight_dict)

## **7** Model Building and Training <font color = red>[10 marks]</font> <br>

### **7.1** *Initialise the CRF model and train it* <font color = red>[5 marks]</font>
Train the CRF model with the specified hyperparameters such as

### CRF Model Hyperparameters Explanation

| Parameter                  | Description |
|----------------------------|-------------|
| **algorithm='lbfgs'**      | Optimisation algorithm used for training. `lbfgs` (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) is a quasi-Newton optimisation method. |
| **c1=0.5**                | L1 regularisation term to control sparsity in feature weights. Helps in feature selection. |
| **c2=1.0**                | L2 regularisation term to prevent overfitting by penalising large weights. |
| **max_iterations=100**     | Maximum number of iterations for model training. Higher values allow more convergence but increase computation time. |
| **all_possible_transitions=True** | Ensures that all possible state transitions are considered in training, making the model more robust. |

Use weight_dict for training CRF



In [None]:
# initialise CRF model with the specified hyperparameters and use weight_dict
crf = sklearn_crfsuite.CRF(algorithm='lbfgs',
                           c1=0.5,
                           c2=1.0,
                           max_iterations=100,
                           all_possible_transitions=True)
# train the CRF model with the weighted training data
crf.fit(X_train_weighted_features, y_train_weighted_labels)

### **7.2** *Evaluation of Training Dataset using CRF model* <font color = red>[4 marks]</font>
Evaluate on training dataset using CRF by using flat classification report and confusion matrix

In [None]:
# evaluate on the training dataset
y_pred_train = crf.predict(X_train_weighted_features)

In [None]:
# specify the flat classification report by using training data for evaluation
print(metrics.flat_classification_report(y_train_weighted_labels, y_pred_train))

In [None]:
# Calculate the f1 score using the train data
metrics.flat_f1_score(y_train_weighted_labels, y_pred_train, average='weighted')

In [None]:
# create a confusion matrix on training datset
print(confusion_matrix(y_train_weighted_labels, y_pred_train))

### **7.3** *Save the CRF model* <font color = red>[1 marks]</font>
Save the CRF model

In [None]:
# dump the model using joblib as crf_model.pkl
joblib.dump(crf, 'crf_model.pkl')

## **8** Prediction and Model Evaluation <font color = red>[3 marks]</font> <br>

### **8.1** *Predict and Evaluate the CRF model on validation set* <font color = red>[3 marks]</font>
Evaluate the metrics for CRF model by using flat classification report and confusion matrix




In [None]:
# predict the crf model on validation dataset
y_pred_val = crf.predict(X_val_weighted_features)

In [None]:
# specify flat classification report
print(metrics.flat_classification_report(y_val_weighted_labels, y_pred_val))

In [None]:
# Calculate the f1 score using the test data
metrics.flat_f1_score(y_val_weighted_labels, y_pred_val, average='weighted')

In [None]:
# create a confusion matrix on validation dataset
print(confusion_matrix(y_val_weighted_labels, y_pred_val))

## **9** Error Analysis on Validation Data <font color = red>[10 marks]</font> <br>
Investigate misclassified samples in validation dataset and provide the insights


### **9.1** *Investigate misclassified samples in validation dataset* <font color = red>[8 marks]</font>



##### **9.1.1** Flatten the labels of validation data and initialise error data <font color = red>[2 marks]</font> <br>



Flatten the true and predicted labels and initialise the error data as ***error_data***

In [None]:
# flatten Labels and Initialise Error Data
y_val_flat = flatten_list(y_val_weighted_labels)
y_pred_val_flat = flatten_list(y_pred_val)
X_val_weighted_features_flat = flatten_list(X_val_weighted_features)

##### **9.1.2** Iterate the validation data and collect Error Information<font color = red> [2 marks]</font> <br>



Iterate through validation data (X_val, y_val_labels, y_pred_val) and compare true vs. predicted labels. Collect error details, including surrounding context, previous/next tokens, and class weights, then store them in error_data

In [None]:
# iterate and collect Error Information
# get previous and next tokens with handling for boundary cases
error_data = []
for i in range(len(y_val_flat)):
    if y_val_flat[i] != y_pred_val_flat[i]:
      error_data.append({
          'token': X_val_weighted_features_flat[i]['token'],
          'previous_token': X_val_weighted_features_flat[i - 1]['token'] if i > 0 else '',
          'next_token': X_val_weighted_features_flat[i + 1]['token'] if i < len(X_val) - 1 else '',
          'true_label': y_val_flat[i],
          'predicted_label': y_pred_val_flat[i],
          'context': f"{X_val_weighted_features_flat[i - 1]['token']} {X_val_weighted_features_flat[i]['token']} {X_val_weighted_features_flat[i + 1]['token']}"
      })


##### **9.1.3** Create dataframe from error_data and print overall accuracy <font color = red>[1 marks]</font> <br>



Change error_data into dataframe and then use it to illustrate the overall accuracy of validation data

In [None]:
# Create DataFrame and Print Overall Accuracy
error_df = pd.DataFrame(error_data)
error_df

##### **9.1.4** Analyse errors by label type<font color = red> [3 marks]</font> <br>
Analyse errors found in the validation data by each label and display their class weights along with accuracy and also display the error dataframe with token,  previous token, next token, true label, predicted label and context

In [102]:
# Analyse errors found in the validation data by each label
# and display their class weights along with accuracy
# and display the error dataframe with token, previous token, next token, true label, predicted label and context

# Group errors by true label
errors_by_label = error_df.groupby('true_label').size().reset_index(name='error_count')

# Calculate total instances for each label in the validation set
val_label_counts = Counter(y_val_flat)
total_val_samples = sum(val_label_counts.values())

# Calculate accuracy per label
label_accuracy = {label: 1 - (errors_by_label[errors_by_label['true_label'] == label]['error_count'].sum() / val_label_counts[label]) if label in val_label_counts and val_label_counts[label] > 0 else 0 for label in all_unique_labels}

# Display class weights and accuracy per label
print("Class Weights and Accuracy per Label:")
for label in all_unique_labels:
    weight = weight_dict.get(label, 1.0)
    accuracy = label_accuracy.get(label, 0)
    print(f"Label: {label}, Class Weight: {weight:.2f}, Accuracy: {accuracy:.2f}")

# Display the error dataframe with relevant columns
display(Markdown("### Error Analysis DataFrame"))
display(error_df[['token', 'previous_token', 'next_token', 'true_label', 'predicted_label', 'context']])

Class Weights and Accuracy per Label:
Label: ingredient, Class Weight: 1.00, Accuracy: 1.00
Label: quantity, Class Weight: 7.26, Accuracy: 0.99
Label: unit, Class Weight: 8.77, Accuracy: 0.99


### Error Analysis DataFrame

Unnamed: 0,token,previous_token,next_token,true_label,predicted_label,context
0,is,pur,,quantity,unit,pur is 2
1,for,oil,,quantity,unit,oil for kneading
2,to,10,,unit,quantity,10 to 12
3,a,haldi,,unit,quantity,haldi a pinch
4,pinch,dal,,quantity,unit,dal pinch asafoetida
5,cloves,tomatoes,,quantity,unit,tomatoes cloves garlic


\### **9.2** *Provide insights from the validation dataset* <font color = red>[2 marks]</font>




Only six tokens were inaccurately predicted; the remaining predictions appear correct.

## **10** Conclusion (Optional) <font color = red>[0 marks]</font> <br>

Write your findings and conclusion.