In [1]:
from tokeniser import Tokeniser

In [2]:
text = '''
  Ent⨳ity NormalisationIn Indic Languages










  Tasmay Pankaj Tibrewal
  1. Introduction
  Entity normalization is central to many NLP tasks. In Indic languages, the challenge amplifies because we must handle multiple scripts (Devanagari, Tamil, Telugu, etc.), plus localized words for months, currency, numeric expansions, etc. Our end goal is to take sentences containing dates, currencies, and scientific units and produce fully spelled-out text in the same language script. This single problem touches on multilingual NER, text normalization, script detection, numeric expansions, and more.
  In this project, we explored three broad strategies:
  Agentic (Prompt-based)
  Algorithmic (Manual rule-based)
  Fine-Tuned LLM (Supervised approach on a synthetic dataset)
  We also produced a dataset of roughly 1,600 synthetic examples spanning 10 major Indian languages and multiple domains. Below is an in-depth account of every step.








  2. Data Curation in Extreme Detail
  Because good data underpins any approach to entity normalization, we begin by dissecting the data generation and splitting procedures meticulously.
  2.1 Motivations and Requirements
  Why Synthetic Data?
  Rarity of Real Datasets: We needed examples that specifically showcased transformations from numeric or symbolic forms to spelled-out forms in each script. 
  Controlled Diversity: By instructing a generative model to produce sentences with multiple entity forms (dates, currencies, units) in each language, we can systematically cover a wide range of scenarios.
  Domain Variation: We wanted to ensure coverage in “medical,” “news,” “financial,” “scientific,” “legal,” etc. contexts, something that real data might not guarantee without extensive curation.
  Language Coverage
  We decided to cover 10 languages: Hindi, Tamil, Telugu, Kannada, Malayalam, Odia, Bengali, Gujarati, Punjabi, and Marathi. Each has unique scripts and morphological nuances. For example:
  Hindi & Marathi share Devanagari but differ in certain vocabulary.
  Gujarati and Bengali have distinct scripts but might share some conceptual expansions for numbers.
  Domain Coverage
  Data is generated from different domains, ensuring that the output is varied and diverse. These are:
  Scientific
  Medical
  Financial
  Literature
  General
  Technical
  Academic
  News
  Legal
  Geography


  Entity Diversity
  Dates: We needed multiple patterns—DD/MM/YYYY,MM/DD/YYYY, DD-MM-YY, 2nd Jan, ‘March 15, 1990’, etc. This ensures the final model or algorithm can adapt to different real-world date notations.
  Currencies: $120, ₹500, INR 700, Rs. 250, €300, ¥1500, etc. The top 10 global currencies, each possibly spelled out differently in each language script.
  Scientific Units: From simple (10kg) to compound (10 km/h), plus temperature (20°C), volume (2 litre), weight (lbs, tonne), and more.
  2.2 Prompt Engineering for Data Generation
  The data was synthetically created using Google AI Studio (Gemini), with carefully structured prompts. A typical master prompt looked something like:
  The prompts can be found in the data generation prompts tab. An iterative approach was used to create the data in the chat. Model used for synthetic data generation is `gemini-2.0-flash-thinking-01-21` as it had long context support, is free of cost and has a huuge output window (~65K tokens).
  This iterative approach let us produce ~1,600 examples, ensuring distribution across:
  Domains: 10 domains in total.
  Languages: 10 languages in total.
  Temperature: 0.1, 0.4, 0.7, 1.0 (Some examples are short and direct, while others are long or more creative).
  2.3 Data Inspection & Quality Checks
  Format Validation: Each synthetic record was verified to ensure it had the keys ["sl. no.", "language", "input", "output", "domain"]. We also introduced a gen_temperature field to track which temperature setting originated in the sample.


  Edge Cases:
  Some lines had no entity. (E.g., “यह एक सामान्य वाक्य है।” in Hindi, with no numeric content.)
  Some lines had multiple (3–4) entities.


  Hallucinations: Occasionally, Gemini would produce incomplete JSON or truncated text. We filtered or corrected these by re-prompting or manually discarding them if they were too broken.


  Redundancy: Redundancy was low, it was ensured that the whole model had the context of the entire chat. Certain words were found to be repeated in a few sentences; however, either the application of words was in a different context, or there was enough diversity between the sentences. All sentences could not be manually verified, but a sample portion was randomly checked (about 100).
  2.4 Splitting into Train & Eval
  2.4.1 State Definition
  Each sample had a (language, domain, gen_temperature) tuple. We call this a “state”. We aimed for each state to appear in both train and eval in a 3:1 ratio. However, we also had exceptions:
  If a state had only 1 sample, we randomly assigned it with 50% probability to train or eval.
  If a state had 2 or more samples, we tried a 3:1 ratio. But if that left a state with 0 in eval, we forced at least one sample into eval.
  Hence, we ended up with ~1,185 train samples and ~415 eval samples.
  Note: A potential shortcoming of the data could be the lack of multiple elements present in each state. Thus requiring a bit more diversity, class balancing and a larger dataset.
  2.4.2 Statistical Distribution
  We carefully examined these to ensure no major domain-language cluster was starved. The distribution of the total data can be found in the data distribution tab.
  2.5 Final Dataset & Access
  Size: 1,600 total.
  Train: ~1,185.
  Eval: ~415.
  Format: Provided as CSV with columns sl_no, language, input, output, domain, gen_temperature. Also shared on Hugging Face at Tasmay-Tib/sarvam-entity-recognition-gemini-2.0-flash-thinking-01-21-distill-1600.
  Note: Being synthetic, it might not perfectly reflect real usage in morphological or domain-specific complexities. However, it’s still valuable for a first pass at the normalization task.



  2.6 Issues Found in the Dataset (Impacting Model Performance)
  Limited Very Long Sentences
  The dataset mostly contains short to medium-length sentences (under ~100 tokens). Consequently, extremely long sentences (150–200+ tokens) are underrepresented.


  Restricted Decimal Number Variety
  Although decimal numbers appear in the dataset, they are not comprehensively represented (e.g., 2.5 but rarely 2.75, 3.14159, etc.). This relative sparseness leads to the model mishandling more complex decimal expansions.


  Rare Date Formats
  Formats like “2 taarik March 2016 ko…” are infrequent. Most examples stick to more standardized forms (DD/MM/YYYY, DD-MM-YYYY, textual month names in scripts, etc.). Hence, the model might fail to parse or transform dates expressed in colloquial or semi-transliterated styles.


  Complex or Uncommon Unit Handling
  Rare or domain-specific units (e.g., mmHg, mEq/L) are not well-represented. The dataset focuses on more common units (kg, mg, km/h, etc.), so the model may hallucinate or omit expansions for those complex, less frequent units.


  Insufficient Numeric Range
  Synthetic examples typically use smaller or moderately sized numbers. Very large numbers or close numeric values (e.g., 74 vs. 75) appear only occasionally. This can lead the model to confuse near-similar values, revealing a gap in numeric variety within the training data.


  Insufficient number of examples
  Although the dataset was relatively diverse, still a larger dataset covering all different sentence types, like uncommon date types, multiple examples for each state. More diversity per language, larger sentences, sentences with decimals and larger numeric ranges, etc. Would have provided a good base for training. Something like ~10K queries would push the model to its maximum.






  3. Three Approaches to Entity Normalization
  We tried three approaches to map input -> normalized output for the same dataset. Below, each is explained in detail.
  3.1 Agentic (Prompt-Based) Approach
  3.1.1 Conceptual Overview
  This method uses a single large language model, e.g., unsloth/Meta-Llama-3.1-8B, sarvamai/sarvam-1, and Qwen/Qwen2.5-3B. We set up a chat-like scenario:
  We prompted the model in a non-instruct manner since we were using the base model versions. We created a movie-script-like prompt. Containing the back-and-forth between the two main characters. This included:
  The main character describing the task
  The main character giving an example
  The helper character giving a feedback of what they understand
  The main character asking to transform a sentence
  The helper character giving a response of the sentence
  This cycle getting repeated (multi-shot prompting)
  We take input the sentence
  We analyze the language of the sentence (using an algorithmic language detector - in algorithmic approach) and use the language’s specific prompt to feed the sentence for transformation.
  The model attempts to produce an immediate, direct transformation.
  Note: The language detector was used to use the stored pre-translated prompt in the same language of the input text, since the model was facing a lot of issues with translation if the prompt included multiple languages (i.e. instruction in one language, examples in other and final sentence in a third one). Also, since this method was only experimental and performed poorly, only Hindi examples were tested (since we had only crafted the Hindi prompt). Still, if the prompt is translated to other languages and stored, the system can also be tested on other language samples. Hindi was chosen just for testing and demonstration purposes, and since this method was not followed, other language samples were not added.
  3.1.2 Limitations & Observations
  Off-Topic or Extra Text:
  The model sometimes appended partial dialogues or new sentences, where it continued the text of the script.
  Or it repeated the user’s instructions instead of producing a direct final result.

  Poor Numeric Accuracy:
  E.g., ¥120 → 120 dollars or messing up the spelled-out date for certain languages.
  No specialized training was done, so it’s guesswork.


  Inconsistent Across Languages:
  In Tamil or Odia, the model might default to transliteration or partial expansions in English.
  Variation in performance was high from one prompt to another.
  3.1.3 Model performance and agentic use
  Llama performed quite well compared to Qwen and Sarvam. For the others, it often converted the entities to words in English and often forgot to convert numbers to words. Thus, an iterative preset set of prompts was stored, which incorporated the model’s new responses and at each step, into updated prompts based on the movie-script structure, instructing on the removal of English words and conversion of numbers to words. This improved Qwen and Sarvam's performance relatively well.

  All the prompts for this section can be found in the Agentic prompt tab.
  Conclusion: The agentic approach is the easiest to set up but gave the weakest results in systematically normalizing multi-lingual numeric data. It’s a fallback method if one can’t or won’t do fine-tuning or code a rule-based pipeline, but it is not recommended for serious usage.
  3.2 Algorithmic (Rule-Based) Approach 
  3.2.1 High-Level Flow
  Script & Language Detection
  We first identify which Indic language (Hindi, Tamil, Telugu, etc.) we’re dealing with, typically via script range checks or secondary heuristics.


  Date conversion
  Dates are recognised by pattern matching in the sentence
  Then the forms of these dates are recognised, and a placeholder is attached indicating that it is a date token, to be converted to words in a different manner (later)


  Regex & Pattern Searches
  We define layered, specialized regexes for dates, currencies, units, and decimal/whole numbers.
  Each date match is replaced or “tagged” with a placeholder (e.g. :$date, etc.) while we store the structured parse data in some dictionary.


  Language-Specific Spelled-Out Expansions
  For each placeholder, we look up the expansions in the appropriate script: e.x. for Hindi: "$" -> "डॉलर", mg -> "मिलीग्राम", etc.
  For the numbers, we modify a num2words function from a pre-built library to get the word for the number after the conversion process.


  Re-Assembly
  After all placeholders are recognized, we replace them one by one in the text. We ensure that spacing, punctuation, or original text ordering is preserved.


  Output
  We finalize the text with all expansions, checking for leftover unrecognized patterns or possible partial collisions.
  Below, we break down each stage in more granular detail.
  3.2.2 Language & Script Identification
  Most of the algorithm’s expansions are language/script-dependent—for instance, “dollar” in Hindi is “डॉलर,” but in Tamil it’s “டாலர்.” Therefore, we must detect which language we’re dealing with.
  Script Range
  For each character, check if it falls in a known block (e.g. Devanagari: U+0900–U+097F). If 80%+ of the text’s characters are in the Devanagari range, we guess it’s Hindi or Marathi.
  Gurmukhi block: likely Punjabi, etc.
  A script confidence score is calculated based on total words found in script X vs script Y.


  Disambiguation (if multiple languages share the same script, e.g. Hindi vs. Marathi)
  Marathi tends to use a given set of characters or alphabets (ex: े or virama/halant) 
  Similarly Hindi has some common set of alphabets (ex: ा or nasalization marks)
  Similarly, vowel marks have different frequencies in each language.


  Probability Calculation 
  Language’s script-based probabilities are directly calculated based on script distribution and language-specific properties 
  Each script is previously mapped to a language; thus, the language’s score is directly affected.
  If the number of languages corresponding to the script is one, then a direct score is directly given for the percentage of the characters of the script out of total chars
  For Hindi and marathi, we start with a base score of 50 and a max adjustment per feature score of 50 (we check for 3 features, two features favoring each language (one is common for both), and for the occurrence of each feature, we find its percentage occurrence, and adjust the score count accordingly
  Maximum score adjustment could be ~100 (for complete confidence in two favoring and zero confidence in opposing features, which could take up the maximum language score to be 150). But practically, that is not likely because of consonants and other language-based characters present in the sentence, taking the individual scores down (since they are based on character probability) or increasing the opposing scores.
  Still, a maximum threshold of 90 and a minimum threshold of 10 is kept to keep the probabilities from getting skewed.


  Further Robustness and Disambiguation
  Despite language-based features, we look for a more robust method of language detection
  This includes n-gram probability extraction, common word markers count per language, and dictionary-based stop-word matching.
  Dictionary-based matching
  A dictionary of each language is created based on content available on GitHub, in NLP libraries, or it is just generated using Gemini (for at least 100-250 most common words)
  Based on that, simple word matching is performed and checked for the presence of dictionary stop words in the sentence, and a score is calculated based on the percentage of common words in each language.
  After the stopwords data was created, it was stored in .txt files with name format (language_stop_words.txt), and then all of this was stored in a zip called 'stop_words_archive.zip'.


  Common marker-based matching
  We can look for common words or suffixes (“है”, “की”, “में” for Hindi vs. “आहे”, “असतो” in Marathi, “நான்”, “நீ”, “அவள்” in Tamil, etc.).
  These are different from dictionary words, since they are looking at which words were present from which language and trying to get a score on the number of unique words per language / total words based on 100 or 200 most common words.
  This is just using the markers, which are to be commonly found across, and trying to measure an occurrence score (how frequently markers occur) per language after dividing it by total marker frequency.


  Ngram based matching
  Ngram is the concept of using n-length character combinations and creating a probabilistic match on top of it.
  This requires the total unique set of common markers and unique stop words (from the dictionary) that are joined together with spaces.
  Then, the spaces are replaced by (n-1) ‘_’ signs. For example: ‘Hi I am Tasmay’ for n=3 is ‘Hi__I__am__Tasmay’ where each space was replaced with two consecutive underscores ‘__’. 
  This now creates a text on which we can iterate over a window of length, and for each unique window, we can tally the counts.
  After we have the total window counts and the individual count of the unique windows, we have built an n-gram model.
  Now, we proceed to do the same thing with the input text and match with the available n-grams. If a match is present, we tally the log probability of the match in our score. Else, we add a log of 1e-20 (a very small number instead of 0 to smoothen the probability)
  We divide the score by the total number of n-grams found in the language to get an average score for each language (average score instead of total score to measure the relative quality of each language and not just the quantity part. It may also be that the dictionary/marker size of a language was larger, thus creating a higher absolute match of the n-grams for the language.
  Then, we normalize the scores between 0 and 1 by the maximum language score by an exponential normalization technique (escore-max_score).
  After this, a final probability score is calculated by dividing language scores by the total scores.
  We also experimented with various other smoothening techniques because of the low probabilities and the difficult-to-handle nature of n-gram matches (but eventually settled with this)
  For the value of n, we started with n=3 and experimented with n=2 and n=4 and found n=4 to be the best performer. Higher values were not considered since they have quite a high probability of shooting into next words (which may make the use case not ideal).
  Note: If we guess incorrectly, expansions might reference the wrong dictionary (e.g., “डॉलर” vs. “डालर”), but typically the matching system was made distinct enough for recognizing major languages accurately (Tamil, Telugu, Kannada, etc.).
  3.2.3 Number conversion
  For the number-to-text conversion we did:
  We use a pre-built library (indic-num2words) to convert numbers into Indian languages.
  We even built an improved num2words function incorporating the previous one, to adjust for decimal numbers and date-based numbering as well.
  Decimal numbers just included identifying the dots attached to the numbers and then iterating towards the next break, converting them into numbers digit by digit instead of the whole number at once.
  For date-based numbering it involved dealing with languages that do include them (Hindi, Gujarati, Marathi, Bengali, Panjabi, Oriya) and that do not (Tamil, Telugu, Kannada, Malayalam) differently. 
  There was no difference in the number-to-text conversion on the second set.
  For the first set, it required separating the number in hundreds instead of thousands (if the hundreds’ value was greater than 0). Ex: 1920 is not ‘ek hazaar nau so bees’ (one thousand nine hundred and twenty) in hindi, but ‘unees sau bees’ (nineteen hundred and twenty). On the contrary, if the hundred position’s value is 0, say for 2020, then the text is ‘do hazaar bees’ (two thousand and twenty) instead of ‘bees sau bees’(twenty hundred and twenty). 

  3.2.4 Date Matching & Recognition
  In the algorithmic approach, dates are processed slightly differently from other numeric or unit-based patterns. Instead of combining them with ordinary numeric tokens directly, we use a specialized pipeline to recognize, interpret, and transform date strings into a standardized textual format, appending the placeholder:$date at the end. Below is a detailed look at how it works, referencing the relevant Python code.
  3.2.4.1 Core Ideas and Flow
  Regex Identification
  The code uses a regular expression to locate any substring that might be a date. A typical pattern is something like:
  This aims to find up to three numeric parts, possibly separated by ‘-’, ‘/’, ‘.’, or ‘,’, with optional whitespace in between. These parts can represent:
  Two-part dates (e.g., MM/YY, DD/MM, MM/DD, etc.).
  Three-part dates (e.g., DD/MM/YY(YY), YYYY/MM/DD, etc.).


  Parser Functions
  After extracting a candidate date substring, we run it through parsing logic:
  parse_date_parts(parts, original_candidate)
  Splits the numeric chunks into 2 or 3 elements and dispatches to either parse_three_part_date or parse_two_part_date.
  parse_three_part_date(parts)
  Tries permutations like DD/MM/YY, MM/DD/YY, YYYY/MM/DD, or a fallback of YY/MM/DD. It ensures that each day, month, or year is valid (e.g., month in 1..12, day ≤ days in that month).
  parse_two_part_date(parts, original_candidate)
  Allows “half-written” forms (e.g., 05/2005 or 15/04) only if the substring uses / (and not - or . since 1.2 may mean the number and 1-2 may mean something like 1 to 2). This function tries DD/MM, then MM/DD, then MM/YY. The year is guessed with a short-year rule (< 25 => 2000+yy, else 1900+yy).


  Validation & Format Tag
  The parser checks each combination to ensure that the day is within the maximum allowed for that month (accounting for leap years if the year is fully known). If no valid date structure is found, we discard the candidate and leave it as-is.


  Conversion to Textual Format
  If a valid date is recognized, we then:
  Possibly omit the day or year if they’re not present (for half–written forms).
  Use mapping from month number (1–12) to a month name in the language of choice (e.g., in Hindi, 3 -> “मार्च”).
  Attach the placeholder:$date at the end. E.g., if we parse 15/03/1990, and the language is set to “Hindi,” the code might transform it to “15 मार्च 1990:$date”.
  This ensures that subsequent numeric expansions can handle the “year” portion differently (e.g., special expansions like “उन्नीस सौ नब्बे” instead of “एक हजार नौ सौ नब्बे”).
  Regex Replacement
  We use a function: replace_dates_in_text() for conversion of the dates in the text into date tagged entities (with recognised formats and converted month words)

  It uses a replacer callback that calls convert_date_str(...) on each matched substring. If convert_date_str returns None (invalid date), we revert to the original candidate. Otherwise, we embed the textual date with:$date.
  3.2.4.2 Step-by-Step with Key Functions
  replace_dates_in_text:
  Finds potential 2- or 3-part numeric combos.
  For each match:
  candidate = match.group(1)
  converted = convert_date_str(candidate, lang)
  If converted is not None, we replace the original substring with that textual form (including $date).


  convert_date_str(date_str, lang="hindi"):
  Cleans the string (removing extra spaces around separators).
  Splits on [-/.,].
  Calls parse_date_parts(parts, date_str).
  If parse is successful, it yields (day, month, year, format_tag).
  We retrieve the month name from month_names[lang.lower()] or default English names.
  Construct the output string as:
  "<day> <month_text> <year>:$date", omitting day or year if they’re None.


  parse_three_part_date(parts):
  Attempt each pattern in turn:
  DD/MM/YY, MM/DD/YY, YYYY/MM/DD, fallback YY/MM/DD.
  Validate the day, month, year. If found valid, return (day, month, year, '...').


  parse_two_part_date(parts, original_candidate):
  Only triggers if we see / exclusively (no - or .).
  Try DD/MM if the second part ≤ 12.
  Then MM/DD if the first part ≤ 12.
  Finally, if those fail, interpret it as MM/YY.
  Return partial (day, month, year, ...), where some might be None.


  Date Validation:
  Helper functions:
  is_leap_year(year) → check if year is leap.
  max_day_for_month(month, year) → get the day-limit for that month (handles February).
  valid_day_for_month(day, month, year) → ensures day ≤ max allowed.
  convert_year_generic(year_str) → short-year logic (if < 25 => 2000 + y, else 1900 + y).
  3.2.4.3 Example Partial Conversions
  "15/03/1990" (Hindi)
  Splits → ["15","03","1990"].
  parse_three_part_date tries dd/mm/yy(yy) → day=15, month=3, year=1990. Valid.
  Becomes "15 मार्च 1990:$date".


  "05/2023" (Hindi, half–written)
  Splits → ["05","2023"].
  Only uses /, so it tries dd/mm, mm/dd, or mm/yy.
  If 2023 is interpreted as a year via mm/yy logic, it’s “mm=05, year=2023,” so day is None.
  Final: "मई 2023:$date".


  "12.25.2003"
  Because . is present, we treat it as “three-part numeric,” so either DD.MM.YYYY or MM.DD.YYYY or fallback.
  If it’s recognized valid, we convert month=25 => invalid. This returns None, so we revert to the original substring.
  Thus, the date substring is replaced in the text with a placeholder-labeled expansion, e.g., "15 मार्च 1990:$date". Later, the numeric expansions can see the suffix:$date and apply specialized year expansions (like “उन्नीस सौ नब्बे” vs. “एक हजार नौ सौ...”) in the final pass.


  3.2.5 Regex & Pattern Matching
  After we’ve identified the language (or at least the script range) in Section 3.2.2, the next step is to scan the text for dates, recognize the date format, convert the month into a word, and tag the date with ‘:$date’ placeholder. This is then followed by the regex and pattern matching part, where we scan recognized patterns—currencies, units, numbers, etc.—and convert them into words. This is typically accomplished by:
  Tokenizing the text into chunks (words, punctuation, whitespace).
  Combining certain tokens that logically belong together (e.g., comma-separated numbers like 2,000, decimal-separated numbers like 3.14, date patterns like 25/12/2022, currency tokens like ₹500, etc.).
  Converting the units, numbers, dates and currency tokens into words (numbers with the help of num2words part) and rest with an extensive set of dictionary containing currency and unit symbols/shorthands mapped to words in each language.
  Token Splitting (split_string)
  The script’s function:
  Uses re.findall to extract:
  sequences of word characters,
  single non-word, non-whitespace characters (punctuation)
  spaces.
  This ensures the text is broken into tokens: words, punctuation, and whitespace, each captured separately. It’s important because we often want to preserve exact spacing and punctuation when reassembling the final string.
  Combining Date Tokens (combine_date_tokens)
  This function specifically looks for a pattern like [ '1995', ':', '$', 'date' ] and merges them into a single token: "1995:$date".
  Rationale: We interpret :$date as a marker that signals a year or partial date string is truly meant to be a date that should be spelled out differently (like for “day month year” expansions in our improved num2words function).
  It iterates through tokens, skipping whitespace, and tries to detect the three consecutive tokens ":", "$", "date" in order. If found, it merges them with the preceding numeric token.
  This keeps the date references compact, e.g., "1995:$date" or "02/05/1998:$date", so we can do specialized expansions later (like “उन्नीस सौ पंचानवे” or “दो मई उन्नीस सौ अट्ठानवे,” etc.).


  Combining Comma-Separated Numbers (combine_comma_separated_numbers)
  Often in Indian numeric formatting, you see 1,000, 2,50,000, etc. The code merges [ '2', ',', '000' ] into [ '2000' ].
  This is a simple pass that checks if a token is a digit, and if the subsequent tokens are “, plus more digits`,” merges them all into a single numeric token.
  E.g.: ['2', ',', '000'] → ['2000'].
  Combining Dot-Separated Numbers (combine_dot_separated_numbers)
  Similar to commas, it merges [ '3', '.', '14' ] into ['3.14'].
  Specifically, it checks if a token is purely digits and if the next tokens form a “. + digits” pattern. If so, it concatenates them into e.g. "3.14"`.
  This step helps us handle decimals in an earlier pass.
  Currency Combination (combine_currency_tokens)
  Pattern A: <number> <currency token>
  Pattern B: <currency token> <number>
  The code checks:
  If a token is a recognized currency symbol or abbreviation ($, ₹, usd, inr, etc.), it checks whether it’s followed or preceded by a digit token.
  We keep a global dictionary, e.x.:
    1) currency_normalization['₹'] = 'inr' or currency_normalization['rs'] = 'inr' or  currency_normalization['inr'] = 'inr', … etc.
  2) currency_language_mapping['usd']['hi'] = "डॉलर"
  This helps us track all different currency symbols and shorthands, from which we can convert them to words in the respective language already identified earlier.
  The function merges them into a single token with the numeric part plus the spelled-out currency (like "500 डॉलर").
  If a currency token is standalone, it might simply convert $ → “डॉलर” in the chosen language.
  One nuance: trailing dots are removed (e.g., “Rs.” is recognized as “rs,” then mapped to “inr,” etc.).
  Because the user might place the currency either before or after the numeric part, the function checks for both patterns.
  All edge cases involving spaces and dots are checked, for more robustness.
  Unit Combination (combine_unit_tokens)
  Pattern A: <number> <unit token>
  Pattern B: <unit token> <number>
  The script has a big unit_normalization, unit_language_mapping and a unit_variants dict, e.g.:
    1) unit_variants["meter"] = ['m', 'mtr', 'mtrs', 'metre', 'meter', 'metres', 'meters'], this contains shorthands and symbols (in thousands) for a range of different units (more than 100).
      2) unit_normalization["metre"] = "meter", this contains the units listed out in an exhaustive manner, with some common alternate versions. 
      3) unit_language_mapping consists of all of these units in unit_normalisation and maps them to different languages based on their language code.
    e.x.: unit_language_mapping[“meter”][“hi”] = "मीटर"
  If we find “500mg,” we first see the number “500” and the unit “mg.” Once recognized, we produce something like “पाँच सौ मिलीग्राम” in Hindi.
  The function tries to accumulate multi-character units (like “km/h,” “kg,” “tonnes”). It also accounts for possible spacing or punctuation (., etc.) in between.
  Because the user might place the unit either before or after the numeric part, the function checks for both patterns. If only the unit is found, only the unit is converted to word form.
  All edge cases involving spaces and dots are checked for more robustness.
  Checking if Token is Numeric (is_number)
  This is a helper that checks if the token is a number, ensuring that if we have one (or zero) decimal point in the token, it’s considered numeric.
  If it’s purely digits or digits with one dot, we treat it as a candidate number.
  If there are any spaces, it is not considered a number since spaces could mean that the dot is actually a full stop, and by default, spaces would not have been incorporated into the numerical token.
  Main Pipeline to Convert Tokens (convert_numbers_to_words)
  Finally (this is the function to convert the normalized sentence after date replacement (shown later) into words, mostly involving numbers, currencies, placeholders, and units along with simple text):
  def convert_numbers_to_words(text, lang='hi'):
      tokens = split_string(text)
      tokens = combine_date_tokens(tokens)
      tokens = combine_comma_separated_numbers(tokens)
      tokens = combine_dot_separated_numbers(tokens)
      tokens = combine_currency_tokens(tokens, lang=lang)
      tokens = combine_unit_tokens(tokens, lang=lang)
      ...
      # second pass to expand leftover numeric tokens

  Tokenize.
  Combine date placeholders, comma numbers, decimal numbers, currency tokens, and unit tokens.
  Expand any pure numeric tokens using the function improved_num_to_word(...).
  Reassemble them while preserving spacing (by rejoining them carefully—note it uses ''.join(new_tokens) but makes sure whitespace tokens remain intact).
  This chain of transformations effectively merges partial tokens into single tokens representing recognized entities (dates, decimals, currency references, unit references) so they can be spelled out or processed further. This is a post-processing step to the date-to-word conversion step, which involves recognizing the date pattern, understanding the date format, replacing the month with its word, and attaching a date placeholder.
  3.2.6 Placeholder Replacement
  We store the date placeholder:$date, in the date recognition part which is removed from the final converted sentence. The concept is as follows:
  Dates can become, e.x.: “12 May 1998:$date,” then we handle them in a separate pass.
  Alternatively, we merge numeric + currency into a single token, e.x.: "500 डॉलर" which is effectively a “placeholder” for the currency expansion.
  Where placeholders appear:
  Specifically for dates, we use patterns like:$date appended to the final token in combine_date_tokens. Then, in improved_num_to_word(...), if it sees:$date, it triggers date-specific expansions (like “सन् 1947 ईस्वी” or “उन्नीस सौ सैंतालीस”).
  Hence, the placeholder mechanism is integrated into the numeric expansions, ensuring a date-year is spelled out according to each language’s century rules. If no placeholder is found, it’s treated as a normal numeric token.
  3.2.7 Detailed Example
  Now we’ll walk through a complete pipeline example that incorporates date recognition, numeric expansions, currency/unit detection, and reassembly. Consider the input:
  "अनिल का जन्म 15/03/1990 को हुआ। उसने $2,000 बचाए थे, और दो दिन बाद 18/03/1990 को (मित्रों से 100lb उधार लेकर) 2.5km चला।"

  Step 1: Date Recognition
  The function replace_dates_in_text with the regex:
  pattern = r'(\d{1,4}\s*[-/.,]\s*\d{1,4}(?:\s*[-/.,]\s*\d{1,4})?)'
  finds:
  "15/03/1990"
  "18/03/1990"


  For each match, we call convert_date_str.
  "15/03/1990" → parse as DD/MM/YYYY. day=15, month=3, year=1990 → "15 मार्च 1990:$date".
  "18/03/1990" → day=18, month=3, year=1990 → "18 मार्च 1990:$date".
  The text becomes:
  "अनिल का जन्म 15 मार्च 1990:$date को हुआ। उसने $2,000 बचाए थे, और दो दिन बाद 18 मार्च 1990:$date को (मित्रों से 100lb उधार लेकर) 2.5km चला।"
  Step 2: Converting Numbers and Entities
  We pass the above string into convert_numbers_to_words(text, lang='hi'), which:
  Tokenize:
  Splits on words, punctuation, whitespace (like "अनिल", " ", "का", " ", etc.).


  combine_date_tokens:
  Looks for patterns [ '1990', ':', '$', 'date' ] → merges → ["1990:$date"].
  E.g., the substring [ "15", " ", "मार्च", " ", "1990", ":", "$", "date" ] becomes [ "15", " ", "मार्च", " ", "1990:$date" ].
  Now we have tokens like "15", "मार्च", "1990:$date", etc.


  combine_comma_separated_numbers and combine_dot_separated_numbers:
  Merges [ '2', ',', '000' ] into [ '2000' ] if any.
  Merges [ '2', '.', '5' ] into [ '2.5' ].
  In our example, $2,000 → tokens might be ["$", "2", ",", "000"] → eventually ["$", "2000"]and  2.5km → ["2", ".", "5", "km"] → eventually ["2.5", "km"].


  combine_currency_tokens:
  Sees a pattern like ["$", "2000"]. $ -> "usd", but with currency_language_mapping["usd"]["hi"] = "डॉलर".
  Eventually merges into a single token "2000 डॉलर" or sets up for numeric expansion.


  combine_unit_tokens:
  For [ "2.5", "km" ], we check if “km” is recognized. Then we produce something like "2.5 किलोमीटर".
  If “lb” is recognized → 'lb' -> 'पाउंड' in Hindi or 'पाउण्ड' depending on the dictionary.
  So ["100", "lb"] might become "100 पाउंड".


  Final Numeric Expansion:
  If a token is "2.5", we call improved_num_to_word("2.5", "hi") → “दो दशमलव पाँच.”
  If a token is "2000" → “दो हज़ार.”
  If a token is "1990:$date", improved_num_to_word sees :$date and does special date-year expansions (like “उन्नीस सौ नब्बे” in Hindi). The presence of :$date triggers a different rule that might handle century logic (for Indo-Aryan languages we might do “सन् ... ईस्वी”).
  Hence we get a final text something like:
  "अनिल का जन्म पंद्रह मार्च उन्नीस सौ नब्बे को हुआ। उसने दो हज़ार डॉलर बचाए थे, और दो दिन बाद अठारह मार्च उन्नीस सौ नब्बे को (मित्रों से एक सौ पाउंड उधार लेकर) दो दशमलव पाँच किलोमीटर चला।"
  Notice each numeric piece (15, 1990:$date, 2000, 100, 2.5) is spelled out in Hindi.
  The month expansions (“मार्च”) came from date recognition, and the year expansions used the date-based numeric logic with :$date placeholders.
  Currency $ turned into “डॉलर,” and lb turned into “पाउंड.”
  In summary, the final pipeline for a complex input with multiple date references is:
  replace_dates_in_text to parse, interpret, and transform date patterns into a textual <day> <month> <year>:$date form.
  convert_numbers_to_words to handle date placeholders, numeric expansions, currency expansions, and unit expansions.
  Reassemble tokens carefully to preserve the original spacing and punctuation.
  This fully algorithmic approach ensures no hallucination while guaranteeing correct expansions for recognized patterns—dates especially, thanks to a specialized date parser and textual month-labelling.
  3.2.8 Error Handling & Fallbacks
  Unknown Patterns: If we see something like “4.2c/s” but “c/s” isn’t in the unit dictionary, we skip or partially convert only the number.


  Partial Overlaps: If a date also includes a currency symbol (rare but possible, e.g., “12/03-1990$”), the system might incorrectly parse. Typically, we do multiple passes or design one big combined regex to avoid collisions. Though a very thorough and detailed approach is kept to tackle special symbols, ‘-’, ‘$’, spaces, numbers, etc. And a very detailed method is laid out for the procedure to go about for parsing. Still, in certain cases, a rule-based approach may not be enough (either due to the absence of certain rules, not thought for (which can be manually added for a more robust system), or due to the semantic nature of parsing the sentence for which rules may not suffice)


  Language Mismatch: If the text is partially English or code-mixed, expansions might still appear in the guessed language script. We can either do partial expansions or fallback.


  Spacing: We must ensure that after removing the pattern from the text, we place the expanded output with correct spacing. Some code uses sub with capturing groups to handle the spacing elegantly (e.g., a capturing group for leading/trailing spaces).
  3.2.9 Strengths & Drawbacks Revisited
  Strengths:
  No Hallucination: We only output expansions for recognized patterns.
  Absolute Accuracy for known forms: If “$120” is in the dictionary, we do it right 100% of the time.
  Lightweight: No GPU or large model needed. Typically runs in near real-time.


  Drawbacks:
  Coverage: Any new format or rare domain (like “c/s”, “ZAR 500”) must be manually added.
  Maintenance: For 10 languages, each new date or currency style becomes a chunk of new code or dictionary expansions.
  Code Switching or multi-lingual sentences are not easily handled, as each script’s expansions might conflict or overlap.
  Despite these drawbacks, many real-world systems rely on rule-based expansions where stable, guaranteed correctness is paramount. For new or unstructured data, an LLM can fill coverage gaps, but the rule-based method remains a strong fallback or post-processor.
  3.2.10 Potential Extensions & Hybrid Approaches
  Hybrid Pipeline:
  Let the LLM generate expansions, then parse that output with a “checker mode” of this rule-based system. If expansions deviate from recognized patterns, correct them.


  Auto-Generation of Regex:
  Some advanced systems attempt to parse user logs to auto-update the dictionary or patterns for new currencies/units. This reduces maintenance.


  Language ID:
  For code-mixed text (“He spent $120 रूपये”), we can attempt segment-level detection. e.g., if a chunk is in Devanagari, we treat expansions in Hindi, else in English. This can become complex in practice.
  3.2.11 Key Takeaways
  The algorithmic system is extremely precise for in-distribution, recognized patterns.
  Multiple passes or a single mega-regex can capture dates, currency, units, decimal expansions, numeric expansions, etc.
  The approach requires carefully curated dictionaries (month names, currency expansions, unit expansions) for each Indic language.
  For real production usage, pairing a rule-based system with an LLM’s more generalized coverage can yield near-total reliability.



































  3.3 Fine-Tuned LLM Approach
  Goal: Teach an LLM the transformation from numeric forms to spelled-out forms by exposing it to hundreds of examples from the synthetic dataset.
  3.3.1 Base Model
  We used “unsloth/Meta-Llama-3.1-8B” in 4-bit quantization. It’s large enough to handle multi-lingual tasks decently but still possible to fine-tune in ~21GB VRAM (with 24 batch size). The quantization (bitsandbytes 4-bit) ensures minimal memory usage. Memory required for model inference is <6GB on GPU.
  3.3.2 Data & Prompt Format
  We fed each training example in the prompt. So that the model can easily understand what is to be done and can train understanding the information. The sample prompt can be found in the finetuning prompt tab.
  We included a single example in the prompt heading as well (“15/03/1990 को…”). The training loop sees 1,185 of these examples, each in multiple epochs. The eval set is 415 examples.
  Here’s an updated version of the Hyperparameters & Training Cost section, clarifying that the so-called “crashed” run wasn’t forcibly stopped but did, in fact, yield surprisingly good (though unstable) results:
  3.3.3 Hyperparameters & Tuning Runs
  We conducted multiple major training runs, tuning key hyperparameters such as learning rate, weight decay, warmup steps, and LR schedulers. Below is a breakdown of how those runs evolved and why.
  Common Settings Across Runs
  Compute Metrics: We monitored Training Loss, Evaluation Loss (primary), plus CHRF, CHRF++, BLEU, WER, CER on the eval set.
  Optimizer: AdamW (8-bit) for reduced memory usage.
  Max Sequence Length: 2048 (though typical input lengths rarely exceeded ~594 tokens).
  Packing: False (no multi-sample packing).
  dataset_num_process: 2 (minor parallelism).
  Per-Device Train Batch Size: Usually 16 early on, and then increased to 24 in later runs when we realized we had leftover VRAM.
  Gradient Accumulation: With gradient accumulation of 4, the effective batch size is 16 * 4 = 64 or 24 * 4 = 96.
  Epochs: Initially up to 10, but often set to 7 or 8 in later runs to save time and refine.
  Precision: bf16 = True (with fp16 = False) for improved numerical stability.
  Seed: 3407 for reproducibility.
  Logging Steps: 1 (frequent logging).
  Eval Steps: Typically 2, changed to 4 in some runs for time savings.
  Load Best Model at End: True, ensuring the best eval_loss checkpoint is reloaded.
  Save Strategy: By steps, typically save_steps = 4.
  Metric for Best Model: eval_loss (lower is better).
  Note: Another custom metric was used to check the model performance (based on eval and train loss): cutom_metric (squared_eval_to_train_loss_ratio = eval_loss2/train_loss): 0.1312. It minimizes both the eval_loss and the ratio of eval to train loss (signifying overfit). this mostly matches the best performance across metrics (thus, this metric, when good, is often when all the other given metrics are in their best spots). found to be consistent from experimentation across 46 training runs.
  One major drawback of this is that it often goes wrong on sudden peaks in train loss.
  an improvement is to use (eval_loss2)/(min(train_lossj )) for j ranging from 1 to i. This is often a better estimate. here eval_lossx and train_lossx signifies the respective losses at step = x.
  An even better estimate ranges from max(0, i-k) to i. where k is a hyper-parameter decided by the user based on the volatility of the training run and the number of total steps.
  Individual Runs
  Orange is ‘sarvam_training_run_main2’ and Yellow is ‘sarvam_training_run_main1’.
  Below is an overview of the major runs (numbered 1 through 5), plus notes on a “crashed” run that ended unexpectedly but still showed strong partial performance.
  Run 1
  Learning Rate: 2e-4
  Warmup Steps: 20
  Weight Decay: 0.01
  LR Scheduler Type: Linear
  Rationale:
  A moderate LR (2e-4) with ample warmup (20 steps) and linear decay—intended to prevent early instability. This was the baseline for comparison.
  Result:
  Training was stable but somewhat slow to converge.
  Indicated that we could push the LR higher to speed up improvement.
  Run 2
  Learning Rate: 4e-4 (doubling from run1)
  Warmup Steps: 20
  Weight Decay: 0.01
  LR Scheduler Type: Linear
  max_grad_norm: 1.0 (to prevent exploding gradients)
  dataloader_pin_memory: True (faster GPU transfer)
  Rationale:
  Since run1 was stable but slow, we doubled LR to 4e-4. We kept warmup at 20 steps for a controlled slope and introduced grad norm clipping to safeguard training.

  Result:
  Quicker convergence than run1 without major instabilities.
  Reasonable final losses, but we still suspected we could push LR even more.
  Run 3 (Crashed, but Produced Good Results)
  Learning Rate: 4e-4
  Warmup Steps: 5 (much lower)
  Weight Decay: 0.01
  LR Scheduler: Cosine
  Rationale:
  Here, we tested a short warmup to ramp LR up very quickly, plus a cosine schedule for a smoother later-phase decay. The plan was to accelerate early learning.
  Result:
  The training ended unexpectedly early (“crashed”), presumably due to an abrupt LR ramp plus not enough warmup steps.
  Partial logs showed performance/metrics that were too good before it crashed, suggesting the model was learning quickly but on the edge of stability.
  On retrying the run, results were produced that were good but not as good as the crashed version.
  Run 3 (New Variation with slight changes from previous)
  Learning Rate: 1.5e-3 (much higher than 4e-4)
  Warmup Steps: 7 (a bit more than 5, but still short)
  Weight Decay: 0.03
  LR Scheduler: Cosine
  Rationale:
  Post-“crash,” we tried an even higher LR but compensated with more weight decay (0.03) to rein in potential overfitting and partial gradient explosion. We also used 7 warmup steps, since the lr would increase quite fast, and to compensate for it.
  Result:
  Very fast convergence and decent final results, but with spikes in the loss curve.
  Still could not match the results from the “lucky” crash run.
  Run 4
  Learning Rate: 2e-3 (increased further)
  Weight Decay: 0.001 (significantly lower)
  Warmup Steps: 10
  LR Scheduler: Cosine with Restarts
  Epoch: 8
  Rationale:
  From experience, a high LR can accelerate initial learning, but you risk overshoot. Cosine with restarts reboots the LR periodically, allowing re-convergence if the model starts overfitting or flattening out. The lower weight decay (0.001) counters the prior run’s heavy penalty (0.03), which showed to be experimentally worse.
  Result:
  Rapid initial improvement.
  8 epochs gave the system, a shorter train time, and a faster learning rate degradation by cosine curve.
  Some overshoot at earlier epochs, but it typically settled near the end.
  Results similar to that of the crashed run
  More unstable training run with more peaks, and a lower eval loss to due overfitting later on in the run.
  Run 5
  Learning Rate: 1.6e-3
  Weight Decay: 0.005
  Warmup Steps: 8
  LR Scheduler: Cosine with Restarts
  Epoch: 7
  Rationale:
  Refining from run4:
  Slightly lower LR (1.6e-3) for more stability.
  Weight decay at 0.005, balancing the extremes of 0.03 vs. 0.001, preventing overfitting.
  Lower number of warmup steps, for a faster rise (required early) and earlier degradation start off (for later stability) of the learning rate.
  7 epochs to keep the total time short and further the effect of a faster declining learning rate on model overfit and training stability.
  Result:
  One of our best overall runs, with high performance across all metrics, better stability and a lower overfit.
  Had a better performance (arguably) than the crashed 3rd run.
  Found to be reproducible in the reproducibility runs
  Chosen as our main final model for subsequent usage (GGUF conversion, inference tests, predictions, etc).

  Potential Future Tuning
  Shorter Runs: Running only 2–3 epochs if we want a quick improvement over the base, and a way faster lr decline for a way more stable run.
  Richer Data: If the dataset grows to ~10k examples, we might need even more epochs or refined schedules.
  3.3.4 Training Time & Cost
  All these experiments were done on an Nvidia L4 GPU. Notable points:
  Steps/Epoch: ~84, given an effective batch size of 96 (24 per device × 4 gradient accumulation).
  Time/Epoch: ~2 hr 20 min (140 minutes), factoring in both forward/backward passes and partial evaluation.
  Eval Steps: ~42 each epoch, taking ~89 seconds each → ~62 min total for eval.
  Total for 7 Epochs: ~202 minutes (~3 hr 22 min)
  For 20 step (best step before overfit begins, by custom metric): ~20 * 140 / 84 → ~ 33 minutes and 20/2 = 10 eval steps ~ 10 * 89 → ~15 min for eval. Thus a total runtime of ~ 48 minutes.
  Detailed Cost Calculation
  Assuming the platform’s billing model uses “units”:
  Cost of L4: 2.4 units/hour = 0.04 unit/minute.
  1 Unit = $10/100 = 10 cents.
  7-Epoch Training (~202 minutes):
  202 min × 0.04 unit/min × 10 cent/unit = $0.808
  ~ Rs. 70 (if $1 ~ Rs. 86.5).
  Model Checkpoint (~48 min to get best partial step):
  48 min × 0.04 × 0.1 = $0.192 (~Rs. 16.6).
  Total for multiple runs or additional trials (total 5 major runs + 1 reproducibility run + 1 crashed run + 39 other runs = 46 runs): ~$9.6 (~Rs. 830.4) if we sum up extended experiments.
  Resource Standpoint:
  The model loads in ~5–6GB of RAM and handles short inference requests at minimal overhead.
  After training, we merged the LoRA adapter into the base model and quantized to Q4_K_M (GGUF).
  We can deploy on CPU via llama.cpp, incurring no monthly GPU hosting cost on a free Hugging Face Space. The model loads in ~2GB of RAM (without –mlock, which forces offloading to stop, increasing inference speeds) due to storage offloading and memory mapping. This approach marries cheap GPU-based training with free CPU-based inference.
  System Setup time - initially, model download, running .sh files, etc., ~6-7 minutes of initial setup time with < 15 seconds of model load time on refresh (model load time and running the first prompt for caching in RAM). Hugging face stops the face on 48 hours of no usage, then the full setup is again required on loading. Response time during inference is drastically reduced due to prompt caching and multi-threading. (However in the spaces only two cores are present and thus it is recommended to host on higher core systems, modern systems with 8 or 16 cores, would give much faster results)
  3.3.5 Results & Potential Issues
  Accuracy:
  For typical patterns (DD/MM/YYYY, $120, 500mg), near absolute correctness.
  Model Metrics at checkpoint:
  train_loss: 0.101
  eval_loss: 0.11551
  cer: 0.12292
  wer: 0.09581
  bleu: 0.87392
  chrf: 94.0154
  chrf++: 93.78756
  cutom_metric (squared_eval_to_train_loss_ratio = eval_loss2/train_loss): 0.1312
  Fumbles on:
  Rare or strange date formats or complex scientific units.
  Very large sentences (~200+ tokens)
  Confuses between similar numbers and large decimal expansions
  Over-generation (the model might keep generating instructions after finishing)
  In short, the fine-tuned approach is flexible and covers more variety than the rule-based approach (if it was in the training data). However, if it sees something truly outside its training distribution, it might guess or hallucinate.














  Reproducibility run plots:






  4. Training & Evaluation Metrics
  4.1 Metrics Chosen
  We meticulously tracked:
  Training Loss & Eval Loss: Classic measure of how well the model fits the data.
  BLEU: Word n-gram precision. Good for measuring overall textual overlap.
  CHRF/CHRF++: Character-based F-scores, essential in highly inflected or morphological languages.
  CER: Character Error Rate, i.e., edit distance at the character level. If CER is 0.12, it means 12% of the characters differ from the reference.
  WER: Word Error Rate. Possibly ~0.08–0.1, meaning 8–10% of the words differ from reference.
  Why so many? Because partial expansions might get a high BLEU but fail at the character level. CHRF, CER, WER each provide different vantage points on alignment with the reference.
  4.2 Example Metric Snapshots
  A “best checkpoint” from run5 might show:
  train_loss = 0.101
  eval_loss = 0.115
  bleu = 0.874
  chrf = 94.015
  chrf++ = 93.788
  cer = 0.123
  wer = 0.096
  Interpretation: ~12.3% of characters differ from reference, which is quite decent for large multi-lingual expansions. The model is basically correct ~88–90% of the time for tricky expansions.
  4.3 Additional Logs
  We also tracked:
  grad_norm: to see if the gradients were exploding or not.
  learning_rate: verifying the scheduler’s shape.

  5. Model Deployment & GGUF Conversion
  5.1 Why Convert to GGUF?
  Once the model is fine-tuned, we want to deploy it for inference in a cost-free environment (like CPU-only Hugging Face Spaces). The “gguf” format, used by llama.cpp, is a CPU-friendly quantized format that:
  Reduces memory usage further than typical GPU quantization.
  Permits near real-time inference for short sequences on 2 CPU cores.
  Minimizes hosting cost. (We avoid renting GPUs indefinitely.)
  Hence we used unsloth’s built-in function push_to_hub_gguf(...) with quantization_method="q4_k_m". This yields a ~5–6GB model file.
  5.2 HF Spaces Setup in Depth
  5.2.1 Repository Structure
  We have a space like Tasmay-Tib/sarvam-ai-entity-normalisation on Hugging Face. Its files include:
  app.py (Streamlit app):


  A text input for the user.
  A function infer() that crafts the prompt (the same “instruction + input” used in training).
  Submits this prompt to the local server via a curl POST request to http://localhost:8081/completion.
  Displays the returned text to the user.
  init.sh:


  Clones llama.cpp.
  Builds llama-server.
  Downloads the .gguf model from Hugging Face.
  Launches llama-server on port 8081.
  init2.sh:


  If the environment restarts, re-check if llama.cpp is present, re-launch the server.
  requirements.txt: Minimally lists dependencies (requests package is the only one).


  5.2.2 Free Tier Challenges
  On free Spaces, the container stops after inactivity. We used an environment variable ran_script_once to skip re-downloading each time. But if the container is truly restarted, it has to recompile. Users might wait a few minutes for the server to initialize. After that, requests are fairly quick for short sentences.
  The whole setup takes about 6-7 minutes to complete. On loading the model once, it stays for the next 48 hours, else Hugging Face will turn it off. Since no permanent storage is subscribed to it will have to reload if started after that. Otherwise, the model takes less than 15 seconds to load.
  In the free tier a big challenge is inference speeds, for which we have switched off the ram offloading option (enabling the whole model to be loaded on RAM), enabled multi-threading and implemented prompt caching for the instruction, example part of the prompt to be preloaded, for quick loading of the model.
  5.2.3 End-User Experience
  Wait-time: The server gets setup (if not already), or the model is spun up for serving and the prompt cached (if it is already setup).
  User: “I have a sentence: 20/02/2023 को मैंने ₹500 में 3lbs चीनी खरीदी।”
  They press “Submit” or click enter on the input box.
  The App: constructs the big prompt and does curl --data '...json...' http://localhost:8081/completion.
  llama-server: loads the model in CPU (already loaded when shown to the user), runs the inference, returns the spelled-out text.
  The App: displays something like “बीस फरवरी दो हजार तेईस को मैंने पाँच सौ रुपये में तीन पाउंड चीनी खरीदी।”
  User: “Copy Output” button to get it in the clipboard.



  6. Comparative Analysis & Discussion
  6.1 Which Method Excels Where?
  Agentic:
  Pros: Zero additional code or training. Just prompt engineering.
  Cons: Inconsistent, no guaranteed coverage, messy for multi-lingual expansions.
  Algorithmic:
  Pros: Deterministic, absolutely correct for known patterns, easy to interpret.
  Cons: Requires large, ever-expanding sets of rules. Misses out-of-vocabulary patterns.
  Fine-Tuned LLM:
  Pros: Adapts to minor format changes or partial transliteration, more “intelligent” coverage if it’s in distribution.
  Cons: If it encounters something unseen, it might guess or hallucinate. Some minor numeric errors, especially for larger numbers or decimals.
  In Real Life, A system might use a hybrid approach: Let the LLM produce expansions, then pass it through the rule-based checker, and for one, just pass the input to a rule-based checker or corrector. Then, have some sort of text classification system, maybe BERT-based or LLM-based, to classify which is better.
  6.2 Synthesis of Costs, Complexity, & Reliability
  The rule-based approach is cheap to run but expensive to expand to new domains.
  The fine-tuning approach costs a moderate amount (~$10 for training), but once done, it is easy to deploy in CPU format.
  The agentic approach is easy to test but rarely meets production-level reliability.
  A fourth method is using an instruct-trained model for deployment, but that won’t require any inputs from this task’s perspective and would require a simple prompt. Thus, that is not demonstrated here.













  7. Scope for Improvement
  We see multiple opportunities for future expansions:
  Larger & More Varied Dataset
  Expand from 1,600 to 10k or 20k samples, ensuring more unusual currencies (SAR, RUB, BRL?), more advanced scientific units (kWh, mmHg, bar), and more irregular date formats. Possibly include code-switched data or incomplete transliteration.
  Improve the dataset on aspects mentioned before:
  Long sentences
  Complex Units
  Decimal based longer numbers
  Larger variety of numbers
  Including rarer date formats
  Integrate real or partially real data (like Indian news articles or official docs) to reduce the synthetic bias, or use data from multiple models.
  Multi-Task or Instruction Tuning
  Tying entity normalization with other tasks like translation, summarization, or classification can yield a more robust model.
  Or incorporate a “chain-of-thought” approach with step-by-step expansions for each numeric entity.
  A general instruct tuning approach may also be tried, which can then be further finetuned for entity normalisation task. (However this would cost much more)
  Fine-Grained Domain Coverage
  Cover a larger variety of data e.g., medical domain might have specific units like “mg/dL,” “mmHg,” “IU,” “mL/h,” which are rarely used in finance or general domains. Each domain can get its own expansions.
  Hybrid Pipeline
  Let the model do its best, then verify or correct it with a rule-based approach. If the model’s expansions deviate from recognized patterns, the system can highlight or fix them, thus combining the best of both worlds.
  Better Language Detection
  Some user inputs might have code-mixed text, e.g., “He purchased $120 में 2kg दूध.” The system must handle partial English and partial Hindi. A robust code-mixing detection method might be required. (This is expected to move generalisation by a lot)


  8. Conclusion & Refrences
  8.1 Conclusion
  Through this project, we deeply explored the domain of multi-lingual entity normalization for Indic languages—focusing on dates, currencies, and scientific units. We produced a synthetic dataset of 1,600 examples, balanced across 10 languages and multiple domains, each carefully split into train and eval sets. Then we implemented:
  Agentic Method: Straight prompts, suboptimal.
  Algorithmic Method: Rule-based, extremely accurate for known patterns but limited coverage.
  Fine-Tuned LLM: A 4-bit LLaMA 3.1 model trained on our dataset. This approach delivered high accuracy on the evaluation set. It still had some corner-case issues (decimal expansions, rare date forms, etc.).
  Finally, we converted the SFT model to a GGUF format for CPU hosting, deployed on a Hugging Face Space with a Streamlit front end and llama.cpp–based server. The entire pipeline is relatively cost-effective, with total training cost under $10 on an L4 GPU and about ($0.81 for one run), and free CPU inference on HF Spaces.
  Next Steps: We plan to add more data, handle code-mixed text, increase the vocabulary for the algorithmic approach, and consider a “hybrid approach” that merges LLM outputs with rule-based validation for near-perfect coverage in real production.
  8.2 References & Links
  Primary Links:
  Synthetic Dataset: Tasmay-Tib/sarvam-entity-recognition-gemini-2.0-flash-thinking-01-21-distill-1600
  Fine-Tuned Model: Tasmay-Tib/sarvam-entity-normalisation-llama-3.1-8b
  GGUF Model: Tasmay-Tib/sarvam-entity-normalisation-llama-3.1-8b-gguf
  HF Spaces Deployment: Tasmay-Tib/sarvam-ai-entity-normalisation
  Secondary Links:
  Main Model training notebook with all three approaches: https://colab.research.google.com/drive/16_c--?usp=sharing
  Wandb Plots for train runs (6 major ones out of 45 total): https://api.wandb.ai/links/tasmaytibrewal-iit-kharagpur/
  Model Inferencing Colab notebook: https://colab.research.google.com/drive/-?usp=sharing
  GGUF Model Inferencing Colab notebook: https://colab.research.google.com/drive/?usp=sharing
  Synthetic Dataset Creation chat (Google AI Studio): https://aistudio.google.com/app/prompts?state=%7B%%22:%5B%%22%5D,%22action%22:%22open%22,%22userId%22:%22107745987607842002805%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing
  HF Spaces Deployment GitHub Repo: https://github.com/Tasmay-Tibrewal/GGUF-HF-deployment
  GGUF conversion notebook: https://colab.research.google.com/drive/?usp=sharing
  Final Model Reproducibility Notebook: https://colab.research.google.com/drive/?usp=sharing
  Wandb Plots for Reproducibility run: https://api.wandb.ai/links/tasmaytibrewal-iit-kharagpur/
  Prediction creation notebook: https://colab.research.google.com/drive/?usp=sharing
  GGUF Model zip file: https://drive.google.com/file/d//view
  Stop-words zip file: https://drive.google.com/file/d//view?usp=sharing
  Model Predictions (eval, train, and total data preds, in normal and excel format, which are utf-8 and utf-8-sig encoded. Excel reads utf-8-sig easily, thus good for viewing, utf-8 is the standard encoding method used for programming purposes, thus it is given as well):
  1. Eval:											      - eval_data_001_predictions.csv (utf-8 encoded): https://drive.google.com/file/d/--le/view?usp=sharing          - eval_data_001_predictions_excel.csv (utf-8-sig encoded): https://drive.google.com/file/d//view?usp=sharing
  2. Train:											      - train_data_001_predictions.csv (utf-8 encoded): https://drive.google.com/file/d/1---/view?usp=sharing         - train_data_001_predictions_excel.csv (utf-8-sig encoded): https://drive.google.com/file/d/-/view?usp=sharing
  3. Total:											      - data_001_predictions.csv (utf-8 encoded): https://drive.google.com/file/d//view?usp=sharing          - data_001_predictions_excel.csv (utf-8-sig encoded): https://drive.google.com/file/d/-/view?usp=sharing
  '''

In [2]:
text = '''DEEP FAKES
    Deep fakes, coming from “Deep Learning” + “Fakes”, are images, videos, or audio manipulated using deep learning algorithms which map a face onto a given target, resulting in a convincing output in the same environment of the target but with the face given; or to extract facial expressions from a given input to generate a manipulated result.

    It has various uses, such as in the entertainment industry, imposing actors’ faces onto other performers such as stuntmen or lip-syncing for multiple languages using dialogue replacement, for marketing, as used in ITC for Sunfeast or the one used by Cadbury in “NotACadburyAd”, both featuring Shah Rukh Khan, it can be used in ed-tech as well where historical figures can be mapped onto teachers making it a fun learning experience; furthermore, it can be used in the gaming industry for realistic content along with other industries.

    The Sunfeast ad included a mapping of your face onto someone else’s face already in the ad. Whereas the Cadbury ad included Shah Rukh Khan branding your business, which basically included altering the portions of audio, modelled on his voice, and lip-syncing his video for the audio. Zomato’s “Mann Kiya, Zomato Kiya” ad which included Hrithik Roshan asking for food from a popular restaurant based on the city you were in, using dialogue replacement. Even we used deepfakes for publicity during freshers selections in 2023. 

    It is also being used in social media to generate funny memes and “obviously fake” impressions of celebrities or iconic moments. There are even fake news channels which have just AI generated content on YouTube.

    Deep fake isn’t just limited to faces or people, researchers from MIT deepfaked a whole city.

    Deepfaked Boston’s Back Bay neighborhood by an AI model trained on images of Aleppo, by MIT Researchers


    It can be used in propaganda, spreading misinformation and fake news. This has already been used to play out global geopolitics, especially by countries such as China, in pro-China news channels. In fact Volodymyr Zelensky, in a video circulated, apparently surrendered. In an era where information travels at lightning speed through social media and online platforms, deep fakes can be used to alter public perception, manipulate elections, or even incite violence. 

    The ability to fabricate convincing videos of public figures saying or doing things they never did raises serious concerns about the erosion of trust and the credibility of digital content. MIT even released a deepfake detection experiment to see how many people get deceived.

    Deep fakes, moreover pose a threat to privacy and personal security. By superimposing someone's face onto explicit or compromising content, individuals can be targeted and blackmailed. Another worrisome aspect of deep fakes is their potential impact on the justice system. With the ability to create fabricated evidence, including audio or video recordings, the credibility of digital information presented in courtrooms could be called into question. It is also being extensively used to pull off scams, like the infamous Elon musk crypto scam and even heists.

    It is most widely used, by far in the pornographic industry, estimates suggest upto 96% of its usage in the space. There are many reports of actors being targeted in such videos and sometimes even normal people are victims of this, especially women.

    “By blurring the line between fact and fiction, deep fake technology could undermine public trust in recorded images and videos as objective depictions of reality.” - said a letter to the director of National Intelligence by the US Congress. It not only means that images, audios and videos have a possibility of being fake, but also shatters our trust in them, leading to questioning of evidence that is actually real. Lawmakers all across the world are scratching their heads about how to deal with this technology. US government even involved the FBI and the Defense Advanced Research Projects Agency (DARPA) which is a military agency to tackle this issue, which used an AI trained on deep fakes to tackle this issue.

    A technology created by AI researchers focuses on combining a recurrent neural network (RNN) and a convolutional neural network (CNN) to enable the programme to detect whether or not a video has been altered. An image-analyzing deep learning algorithm is called a CNN. An RNN is a kind of neural network that is frequently used in applications like speech recognition, natural language processing, and translation. It employs sequential or time series data. Guera and Delp (2018) showed that a CNN and an RNN feedback loop is a useful combination for reliably identifying deep fakes. This technique basically depends on teaching AI to detect subtle linguistic and visual inaccuracies and inconsistencies in deep fake content, just like a human would. Another promising approach to the technological detection of deepfakes is this method.

    But with time, these are improving, too exponentially, high resolution hyper-realistic depictions of reality, convincing even to the best trained human eyes, how good can good be, better than reality? Can it warp the reality we are surrounded with and we trust. How good can the AI models to detect deep fakes get, can they spot what human eyes can’t and what stops newer deep fake models to be trained just to defeat these detection models. Current best detection models show a 94.4% accuracy, that too on 2020 data, considering computational improvements, newer techniques, efficient, better and faster algorithms will perform the same on best quality deep fakes from 2023. Even a 94.4% accuracy means a carefully constructed deepfake, trained on extremely large datasets using state of the art models for a very large time would beat it.

    But how do deep fakes work (from absolute basic all the way up)?
    Deep fake uses a combination of Generative Adversarial Networks (GANs) and Auto-encoders.

    Generative Adversarial Networks (GANs) :
    GANs are pretty much just two models (generally CNNs) in combination. One acts as a discriminator while the other is a generator.

    Now what are these complicated words, GANs, CNNs, Encoders?


    Intro to ML
    Consider a model. A model is some function which outputs a value on some input. A more complex version of this is y=wx+b, where w is called weight and represents slope of a line and b is the bias or the y-intercept.

    This simple function is used in linear regression models. Regression is when we have to predict some output using an input. 

    It may not be possible to do so every time, but you want to be close to predicting the output, as much as possible. Say house prices vs area of house. 

    Now, this may be correlated linearly or exponentially (i.e. y=w1x + w2x2 + …. +wnxn +b)
    [image on right], 

    Now this is just with one feature or type of input, there may be multiple dependencies or features. Thus 
    y = w11x1 + w21x12 + …. +wn1x1n 
    w12x1 + w22x12 + …. +wn2x1n + b
    [Ex: House price dependent on both area and number of bedrooms]

    Now this is the most simple machine learning model, linear regression.

    We start with random values of weights and biases and optimise the algorithm to maximise accuracy and minimise the error.

    We can calculate error over all data points and sum it up as the cost (measure of how bad the model performs). Generally for regression we use Mean squared error.

    h(x) = y [predicted] = wx + b;
    Error = (h(x) - y)2
    Cost = J(x) = 1/n * Σ(h(xi) - yi)2; where xi and yi represent the ith input and output respectively.
    To maximise the performance of the model, we minimise this cost or error for which we use optimisation algorithms such as gradient descent.

    To optimise we calculate the value of weights and biases for which cost has minimum value.

    Gradient descent calculates the partial derivative of the cost function, with respect to the weights and biases [gradient vector of the weights], and updates the weights by subtracting them with a negative multiple of it.

    w := w - α * ∂J(w)/∂w; the gradient or derivative represents the slope and -α * ∂J(w)/∂w is some negative multiple of it, if updated correctly it converges to a minima.



    This is how ML simply works, set some parameters with random values and optimise it until cost becomes low.


    Neural Networks
    This is a bit complicated and would urge you to watch this series for a better understanding.

    Neural networks are something which scientists modelled to replicate how the human brain works. It contains layers and nodes within them called neurons linked to each other.

    A simple neural network is this

    This contains many input neurons and 1 output neuron, basically a single-layer network.

    The output neuron accepts inputs and gives an output.


    h(x) = w1x1 + w2x2 + w3x3 + …. + b; thus the simplest neural network is just linear regression.
    A more complicated neural network has hidden layers.

    Generally we also add a non-linearity function to account for non-linear dependencies.

    h(x) = σ(w1x1 + w2x2 + w3x3 + …. + b) or RELU(w1x1 + w2x2 + w3x3 + …. + b)

    σ(x) = Sigmoid function = 1/(1+e-x); RELU(x) = max(0,x).

    This one has two hidden layers. (considering 2 neurons in each layer and 1 output)


    h1 = σ(w1x1 + w2x2 + b1)
    h2 = σ(w3x1 + w4x2 + b2 )

    h3 = σ(w5h1 + w6h2 + b3)
    h4 = σ(w7h1 + w8h2 + b4)

    y = σ(w8h3 + w9h4 + b5)

    All this compounding of outputs, eventually results in a complex space represented by the model. Optimised by gradient descent using backpropagation this can theoretically represent all patterns and even memorise it, for a large enough training dataset and model architecture.

    Linear regression has a paraboloid or equivalent cost function whereas neural networks have complex cost functions, which may even have a local minima. (We use various techniques such as multiple iterations of random initialisation of weights and moments to tackle this)

    Thus our goal for large networks is not to completely optimise, but to reach to a good enough solution.

    Back-propagation in neural networks (intuition and calculation)
    Here we basically want to account for the weights to be updated by moving back layer by layer.
    Intuition
    Here we are classifying an image of a hand-written two, using a neural network. (Image has values of pixels to predict the output)

    Ideally it should output 1 but it outputs 0.2. So we have to increase its activation (value of neuron). We also have to decrease value of other numbers. Thus we have a quantitative idea of increase/decrease required in neurons of output layer.

    To do this we change value of not only weights and bias of this layer, but also the neurons of previous one.

    That leads to change in the previous layer weights and biases, along with that in the last to previous layer, and so on.

    The increase in weights are done such that we maximise our increase/decrease required. To do this we change them proportionally to the activation of the neuron they are connected to. This maximises the effect of increase.

    For a 0.9 activation if we increase its weight from 0.1 to 0.3 we get a change of 0.09 to 0.27, net change of 0.18, whereas for a 0.1 activation the same results in a net change of 0.02. Thus we increase to get the most effect, thus proportional to the activation.

    Similarly neuron activations would be increased proportional to value of weights.

    This cycle continues until input layer.

    Calculations
    wi := wi - α * ∂J(W,B)/∂wi; for all wi in W [vector of weights].
    bi := bi - α * ∂J(W,B)/∂bi; for all bi in B [vector of weights].

    We give a naming convention to neurons:

    The last layer activation is aL; then aL-1 and so on.

    aL = σ(zL) = σ(wLaL-1 + bL)
    Here z is the fn which goes through non-linear function to give activation, 

    zL = wLaL-1 + bL
    L represents the last layer, wL and bL are the weights and bias for the last layer.

    ∂J(W,B)/∂wiL = ∂J/∂aL * ∂aL/∂zL * ∂zL/∂wiL; [wiL being the ith weight for last layer]

    ∂J(W,B)/∂wiL = ∂J/∂aL * ∂aL/∂zL * ∂zL/∂wiL = ∂J/∂anL * ∂anL/∂znL * ∂znL/∂wiL ; n is the neuron to which the weight is connected, i is the ith weight.

    J(W,B) = 1/k*Σ(h(xi) - yi)2 = Σi = 1 to k Σj = 1 to m  (ajL - yj)i2, k being the number of training examples and m being the number of neurons on the last level.

    aL = σ(zL);  zL = wLaL-1 + bL; 

    ∂J/∂anL = 2*(anL - yn)i 
    ∂anL/∂znL = σ’(zL)
    ∂znL/∂wiL = apL-1 ; p is the neuron from which weight was required

    Thus, ∂J(W,B)/∂wiL = 2*(anL - yn)i * σ’(zL) * apL-1 

    Similarly for neurons in previous layers:
    ∂J(W,B)/∂wiL-1 = Σj=1 to m ∂J/∂ajL * ∂ajL/∂zjL * ∂zjL/∂anL-1 * ∂anL-1/∂znL-1 * ∂znL-1/∂wiL-1 ; m being number of neurons on last level

    ∂znL-1/∂wiL-1 = apL-2 ; p is the neuron from which weight was required
    ∂anL-1/∂znL-1 = σ’(zL-1)
    ∂zjL/∂anL-1 = wjnL-1; wjn is weight from jth neuron to nth neuron.
    ∂ajL/∂zjL = σ’(zjL)
    ∂J/∂ajL = 2*(anL - yn)i

    ∂J(W,B)/∂wiL-1 = Σj=1 to m 2*(anL - yn)i * σ’(zjL) * wjnL-1 * σ’(zL-1) * apL-2 ;

    Similarly,
    ∂J(W,B)/∂biL-1 = Σj=1 to m 2*(anL - yn)i * σ’(zjL) * wjnL-1 * σ’(zL-1) * 1;

    We can do this for all weights and biases. These partial derivatives for weights gives us the gradient of the vector W. 
    W := W - α * ∇ J(W,B)
    B := B - α * ∇ J(W,B)

    We update W and B vectors with a negative multiple of their gradients [gradient is the steepest slope upward for a function]. This done through multiple iterations leads to convergence.


    




    Convolutional Neural Networks
    What is a convolution?
    Convolution is a mathematical operation similar to what multiplication or division is.
    It is represented by “∗” sign. Watch this video for a better understanding.

    (a∗b)n = (a∗b) (n) = Σi,j i+j= n ai ⋅ bj = Σi = 0 to n ai ⋅ bn-i = a0 bn + a1 bn-1 +a2 bn-2  + . . .  + an-1 b1+ an b0
    Here a and b are a list, vector or a matrix containing elements a0, a1, . . ., am and a0, a1, . . ., ap respectively where m,p >= n.


    Thus for a complete list:
    (a∗b) = [(a∗b)0, (a∗b)1, (a∗b)2, . . . , (a∗b)p+m]

    a = [1, 2, 3, 4];  b = [5, 6, 7, 8, 9]

    (a∗b) = [ (1*5), (1*6 +2*5) , (1*7 + 2*6 + 3*5) , (1*8 + 2*7 + 3*6 + 4*5) , (1*9 + 2*8 + 3*7 +4*6) , (2*9 + 3*8 + 4*7) , (3*9 + 4*8), (4*9)] 
    Thus,
    (a∗b) = [5,  16, 34, 60, 70, 70, 59, 36]

    Visual Process:
    A                    :                 [1, 2, 3, 4]              [1, 2, 3, 4]            [1, 2, 3, 4]        [1, 2, 3, 4]
    B (Reversed):  [9, 8, 7, 6, 5]              [9, 8, 7, 6, 5]            [9, 8, 7, 6, 5]        [9, 8, 7, 6, 5]
    (a∗b) =                        [     1*5 ,                1*6 +2*5,      1*7 + 2*6 + 3*5, 1*8+ 2*7+ 3*6+ 4*5,

    A                    : [1, 2, 3, 4]       [1, 2, 3, 4]           [1, 2, 3, 4]             [1, 2, 3, 4]
    B (Reversed):  [9, 8, 7, 6, 5]       [9, 8, 7, 6, 5]          [9, 8, 7, 6, 5]             [9, 8, 7, 6, 5]
    (a∗b) = , 1*9 + 2*8 + 3*7 +4*6 , 2*9 + 3*8 + 4*7,      3*9 + 4*8,                 4*9]

    For functions:
                    ∞
    (f∗g) (t) = ∫   f(x) f(t-x) dx
                            -∞  
                        ∞
    (f∗g) (2) = ∫    f(x) f(2-x) dx = some const. value; Thus for (f∗g) (t), each point is an
                            -∞
    Integral for all x from -∞ to +∞ for a particular value of t.
    2D convolutions
    A 2D Matrix, also called kernel is used to convolute over a 2d matrix in the same way it does over a 1D matrix.







    ∗                    =




    Matrix A                                   Matrix B                       Matrix (A ∗ B) = C
            

    C11 = A11B11 + A12B12 + … + A21B21 + A22B22 + …. + AnnBnn

    C12 = A11B12 + A12 B13 + …. + AnnBn(n+1)

    And so on….





    Image Convolutions

    This kernel is used for 
    a normal blur filter 
    on images.
    Basically it averages out 
    The nearby pixel values.


    This kernel is used for 
    a gaussian blur filter 
    on images.
    This takes weighted average of neighbouring pixel values, according to gaussian function.

    These kernels are used for Horizontal and vertical edge detection 
    on images.



    Convolutions are popularly used in Image processing and kernels are of various types.

    Problems with our original neural network model on detecting numbers from hand-written images:
    Computationally expensive: For High-res images, with millions of pixels, there would be millions of weights to be optimised, it would be a very high dimensional data and would take up huge computation power.
    Poor training: High dimensional datasets require very large datasets to train, and accuracy is poor compared to other more efficient models, such as CNN and requires a considerably higher number of data points to reach the same accuracy.
    Low parameter efficiency: Requires a large number of parameters as compared to CNNs, which share them for pixels while convoluting. This leads to computational and storage inefficiency.
    Hierarchical Feature Learning and Spatial Pattern Recognition in CNNs: CNNs are capable of learning hierarchical features from raw pixel values. Lower layers learn basic features like edges and textures, while higher layers learn more complex features and object representations. Standard feedforward neural networks do not consider the spatial relationships between pixels in an image. Images have a grid-like structure where the arrangement of pixels carries important information. Convolutional Neural Networks (CNNs) are specifically designed to handle such grid-like data and capture spatial patterns through convolutional layers.

    Basically CNNs are able to recognise patterns, they pick up certain patterns and get the most activated for them.  They understand the underlying facial structure for faces, or predict images just how we thought a standard neural network would do while modelling it. While looking at the results of the neural network,  it was more arbitrary, this could be because it is a very high dimensional data, so it would require way more training examples to find patterns, but if such huge datasets exists to satisfy the tens or hundreds of millions of weights and biases on which it needs to be trained, then they potentially could produce pretty solid results. 

    On the contrary, CNN captures patterns like a human brain would, by recognizing edges in earlier layers, patterns then, and features and structure of the image in proceeding layers. They can convert the image into a latent space, which is like a low-dimensional representation of the high-dimension data that the model has captured. We can ideally associate a face image’s latent space with basic features such as 
    emotions, age, gender, etc.

    These types of results are possible because of the architecture of CNN models, which consists of kernels with random initial weights and biases convoluting over images finding patterns such as edges, early on and moving to complex patterns as we go further.

    CNN Architecture:

    CNN consists of different types of layers:
    1) Convolutional Layers: These consist of a set of kernel weights, randomly initialised and optimized later through gradient descent.

    w1 , w2, . . ., w9 are a set of weights of each kernel that we use. We also have a bias term attached to output of each kernel. Thus for a 3x3 kernel we have 9 weights and 1 bias.
    We use one kernel for each image in the previous layer for each image in the current layer.


    Basically it means that while going from layer 1 to layer 2 for each image channel in layer 2 (say n2) we have weights for all image channels in layer 1 (say n1), thus in total we have n1*n2 kernels for layer 1 to 2.

    Thus we have a total of n1*n2*10 weights and biases for the layer. As compared to n1*n2*28*28 weights if we had used pixels individually.

    We further, similar to a simple feed-forward neural network also use an activation function to introduce non-linearity.
    Otherwise it would just be kernels over kernels, which just produce a linear result.

    Z22 = w1 X11 + w2 X12 + w3 X13 + w4 X21 + w5 X22 + w6 X23 + w7 X31 + w8 X32 + w9 X33+ b
    (we are convoluting over X22, we keep the centre of our kernel on X22)

    A = RELU(Z) = σ(Z) 

    A is the activation of the pixel part of a particular image in the next layer, for which we convoluted through the kernel.

    But there is a problem, when we try to compute Z11, some parts of the kernel are out of the region with the pixels. 

    This happens with all the corner, edges of the image and we commonly use zero-padding to counter this.

    What is zero padding?
    We add some extra rows and columns wherever needed with values as zero such that the output is of the same size.

    Suppose we have a 3x3 matrix for a 28x28 inout image then we wilkl get a 26x26 output image thus we use a 1 layer zero padding, effectively making the image a 29x29 image, 

    This is the example of a 5x5 matrix with a single layer zero padding.

    The output remains a 5x5 matrix.

    There are other types of padding as well but zero-padding remains the most widely used one.





    2) Pooling layers
    Pooling layer is when we combine some squares into one using some properties.
    There are many types of pooling, the most common is max-pooling which takes max of all cells as the output value, there is also min pooling, average pooling, etc.

    Pooling is done to reduce the overall amount of information and to save on memory/parameters. Although information is lost, the crux of the image is retained.





    3) Flattening
    Finally when the size of the enough is small enough, the image is flattened into a vector (with a single row) and final neural network layer(s) is applied to it.

    4) ANN layers:
    Normal feed-forward neural network layers act as additional layers to the network.


    5) Hyperparameters:
    There are several hyperparameters in a CNN. They include 

    Number of Layers
    Width of the layers
    Kernel size [nxn]: Depends on the size of image, mostly n is an odd number, can range from 3x3, 5x5, 7x7, 9x9, 11x11, 13x13]. Larger the size more the information is lost.
    Padding: How much extra space is added around the image for convolution operations, and with what values, usually zero-padding is used to preserve the size (called Same padding), otherwise Valid padding (no padding), other types include replication padding, reflection padding, etc.
    Pooling Type and Kernel size: Type of pooling used max pooling, average pooling and min pooling and kernel used for pooling, larger the kernel more is the information lost.
    Stride: The step-size taken by convolution/pooling operations is called stride. By default stride for convolutions is 1 and for pooling is the size of the poolling kernel.

    6) Backpropagation: Similar to that in ANN (Artificial Neural Networks), we backpropagate following chain rule across layers, however navigating for convolutional architecture may not be straightforward and a challenge on its own, to keep it concise it is not covered in the scope of this blog.  

    I have attached the explanation of backpropagation here.

    7) Fast Convolution Implementation:

    The convolution process can be extremely slow for large networks, across thousands of input images, across millions of parameters, for thousands of iterations and hundreds of steps.

    So we implement it in a matrix form. Where we have the convolutional vector and a flattened input vector, and we perform the operation as their matrix multiplication.

    Say we have a 4*4 input image and a 3*3 kernel, then we can construct the vectors as follows:

    Kernel (K11, K12, …, K21, .., K33)     Image(I11, I12, …, I21, .., I44)        Output (O11, O12, O21, O22)


                                        
                                                                                                
    Ideally we could have done O11 = K11I11 + K12I12 + .. + K13I13 and iteratively repeated for O11 - O1n then O21 - O2n and till Onn.

    But it is computationally expensive, so we express each element in O as function of two matrices since matrix operations are highly optimized:



    O11 =  Dot product of K11 with I, O12 = Dot (K12, I), O13 = Dot (K13, I);  O14 = Dot (K14, I)


    Matrix K11                        Matrix K12

    This is faster, but dot product can be represented further as dot of flattened K11 and I vectors. 

    Here, 
    K11 = [1, -5, 4, 0, 0, 3, 1, 0, -3, -2, 0, 0, 0, 0, 0, 0]T = A 
    If = [10, 25, 20, 2, 25, 15, 18, 5, 5, 17, 20, 10, 1, 12, 25, 7]T = B

    But dot of 2 vectors (A.B) can further be represented as 

    (AT x B) = (K11 )T * If = O11

    Similarly, O12 = (K12)T * I and so on…

    Thus a new vector can be constructed: 
    C = [(K11)T, (K12)T, …, (Knn)T]T
    ; where K11 - Knn are not the elements of Kernel K, but Kernel K zero-padded to fill the shape of Image I at its different convolutional positions and then flattened out as a row. Here C itself is a column vector but its elements are row vectors thus making a 2D matrix.

    
    Matrix C


    Here, If is the flattened-out image column vector, thus we can represent the convolution operation: K * I = O, C x If = Of, where O is the output matrix (not the null matrix) and Of is the flattened Output matrix.

    K is a 3x3 matrix, I is a 4x4 matrix thus O is a 2x2 output; If is a 16x1 flattened vector, and C is a 4 x 16 (16 for 16 elements of I, 4 for 4 positions (2x2) in Ouput) matrix.

    Thus C x If produces a  [(4x16) x (16x1) = 4 x 1] which is of the same size as the flattened matrix, this can be reshaped to get the output.

    Why is this process more efficient?
    For an mxm Kernel and an nxn Image, instead of convoluting (n-(m-1)) * (n-(m-1)) times, we can directly compute C vector by first padding K with zeros, to get C1 and then adding a zero at start and removing from end, equivalent of iterating K along a row of I; then after n-(m-1) elements, we add n 0 values to the start and remove n 0 zero values from the end, this process is repeated to get (n-(m-1))^2 elements of C; which is then multiplied with I using highly optimized matrix multiplication algorithms.

    8) Reconstructing Images through Transposed Convolutions:

    Now we can express a convolution operator as matrix multiplication of the convolution vector with the image. Of = C x If (where ‘*’ represents the convolution operation); in general case O is of order (n-(m-1))2 * 1, C is of order (n-(m-1))2 * (n2) and If is of order n2 * 1.

    Multiplying the eqn with matrix D (of order n2 *  (n-(m-1))2) :

    DxOf = DxCxIf 

    DxC results in an n2 * n2 vector, and further multiplying with If results in an n2 * 1 vector, DxOf results in an n2 * 1 vector as well.
    Now we come to the concept of generalized inverses, for every matrix A of order p*q there exists its generalized inverse of order q*p; which may not be unique to A, and not reversible in nature but it does exist.

    So ideally we can consider D as the generalized inverse of C, where DxC = In^2 (Unit vector of Order n2), and In^2 x If = If, thus D is represented as Cg, the generalized inverse of C, the equation changes to:

    Cg x Of = If, Cg is of order n2 *  (n-(m-1))2 and Of is of order (n-(m-1))2 * 1

    Backpropagation can easily be done, by rephrasing the equation, transposing D and taking a dot with Of thus representing IfT as the dot of (Cg)T with Of and then the gradient can be calculated easily (the corresponding elements of Of are the gradients).

    So we try to learn the vector Cg, but even if we learn something close enough, we can easily reconstruct the image vector. This has an additional benefit, through multiple iterations of convolutions and transposed convolutions (also called deconvolutions), we try to converge to the value of Kernel, which itself is changing,  and thus because of the delay in converging, reach a matrix similar to Cg in terms of values, but not exactly Cg thus instead of exactly reconstructing an image, we reconstruct an image quite similar to the image, introducing inherent variation.

    Encoder-Decoder Networks and U-Nets

    Encoder-decoder networks are a fundamental architecture in the field of deep learning, widely used for tasks that involve mapping inputs to outputs, where the dimensionality of the input and output can vary. This architecture is particularly prevalent in applications like machine translation, image captioning, and sequence-to-sequence prediction tasks.

    Encoder
    The encoder part of the network takes the input data and compresses the information into a context vector (also known as a feature vector or state vector). This vector aims to encapsulate the essence of the input information in a fixed-size representation, regardless of the input size. The encoder processes the input through one or more layers (which can be fully connected layers, convolutional layers, or recurrent layers, depending on the nature of the input data) to produce this context vector.

    Decoder
    The decoder part of the network is responsible for taking the context vector generated by the encoder and translating it into the desired output format. The decoder essentially learns to generate the output data from the compressed information while potentially considering additional inputs during the generation process. Similar to the encoder, the decoder can consist of various types of layers tailored to the specific requirements of the output data.

    This can simply be a Neural Network, used to compress the data and learn its implicit representation in a highly efficient manner, and used to regenerate it when required.

    For computer vision tasks, this network is generally represented by a complex convolutional network, using convolutions and pooling/strided or even pus convolutions to compress an image and transposed convolutions to decompress it.








    This has various potential uses, it can act as a generation network, an image segmentation model, a de-noising and an upscaling network as well.

    But why to compress?
    Why do we compress the layer in narrower layers if we can just make broader and larger networks which will probably produce better outcomes.

    A network of the same number of neurons can just learn to multiply itself by one and give the input image as the output, thus the network is inherently:
    Not efficient at parameterization
    Not learning properly

    One can then say to reduce the number of pixels by one and then construct the network, but yet again that means that all parameters are more or less copying the input and one pixel is actually constructed, in other terms the network is not understanding the basis of an image, not constructing the latent space as it ideally shall construct.

    Thus there is a tradeoff, between capacity of the network to actually learn the latent representation along with its efficiency and the overall model capacity. Not only this, but there is also a tradeoff between information preserved, and that lost, because if the network is reduced by a lot, large amount of information loss is possible.

    One may argue that other than generation tasks say for segmentation models, where we are not producing the input image but rather separating different types of objects within the image. There, one may say that we won’t face the tradeoffs between representation and capacity, since model can not just copy the inputs. Although this is true to an extent it can very well be argued that the model is still not efficient at parametrization and thus not learning effectively. In simple terms a large model may produce the correct result, but it is not that process to produce the result that is correct or most ideal, ideal being subjective to what we desire, which may be a good latent space representation and an implicit understanding of the image.

    A different technique to use larger networks for generation tasks could be by using what we call as de-noising auto encoders. These add noise to the input and try outputin the denoised image, thus turning it into a denoising task. As mentioned above for segmentation tasks, since the model can not directly copy the image, these models will learn the representation relatively better for larger networks.




    U-Nets
    Inspired from its shape, U-Nets are a type of convolutional neural network (CNN) architecture that resembles the letter "U". They were originally designed for biomedical image segmentation tasks but have since found applications in various areas requiring precise and detailed pixel-level predictions, such as satellite image analysis, autonomous vehicle perception, and even art style transfer. The U-Net architecture is particularly notable for its effectiveness in working with a small amount of data, a common scenario in medical imaging.



    This model includes convolutional layers, and pooling layers for encoding and transposed/deconvolutional and skip connection layers for decoding information.

    The main challenge faced by a standard encoder-decoder network is to overcome the loss of information. A distinctive feature of U-Nets is the use of skip connections that directly concatenate feature maps from the encoder to the corresponding layers in the decoder. These connections provide the decoder with detailed local information from the input image, which, when combined with the global context acquired during downsampling, allows for more precise segmentation. 

    Skip connections can be implemented by cropping the images to the desired size and then concatenating the channels with the upsampled channels.

    A potential downside for skip connections is that it can directly learn to use the values from skip connections and not use the encoded information, thus making the whole network useless, as discussed in the previous section for large networks.

    There are various mechanisms to counter this such as noising and denoising, data augmentation, and regularization using dropout layers, which drop out some neurons essentially forcing the model to learn representation throughout and also initially freezing and gradually unfreezing the skip connection layers after training the model.

    The loss functions used for this model could be binary/categorical cross entropy for image segmentation tasks, but for image generation tasks we can use mean squared error loss, since it is similar to regression of a value, although other losses may very well be used depending on the application.

    GANs (Generative Adversarial Networks)
    Generative Adversarial Networks (GANs) are a groundbreaking and influential class of neural networks designed for generative modeling, a type of unsupervised learning. Introduced by Ian Goodfellow and his colleagues in 2014, GANs have revolutionized the field of artificial intelligence, particularly in tasks involving generating highly realistic images, videos, music, and even text.

    Basic Concept
    The core idea behind GANs is relatively straightforward but profound. A GAN consists of two neural networks, the Generator and the Discriminator, which are trained simultaneously through a competitive process:

    Generator (G): This network learns to generate data (e.g., images) that resemble the real data. Its goal is to produce outputs indistinguishable from genuine data to the extent that the Discriminator cannot reliably tell the difference.

    Discriminator (D): In contrast, the Discriminator learns to distinguish between the real data (from the training dataset) and the fake data produced by the Generator. Essentially, it acts as a critic that gets better and better at identifying what's real and what's not.

    Training Process
    Training a GAN involves a delicate balance where the Generator and the Discriminator improve in tandem through an adversarial process, often described as a "minimax" game. Here's a simplified overview of the steps involved:

    Training the Discriminator: Initially, the Discriminator is trained with a batch of data containing both real and fake images (generated by the Generator). The goal is to maximize its ability to correctly label the images as real or fake.

    Training the Generator: Next, the Generator is trained to fool the Discriminator. The Generator's output is fed to the Discriminator, and the Generator is updated based on how well the Discriminator was able to distinguish the fake data from the real data. The objective is to minimize the Discriminator's accuracy, thereby improving the Generator's ability to produce realistic data.

    Iterative Improvement: This process is repeated in numerous iterations, with both networks improving over time. The Generator learns to produce increasingly realistic data, while the Discriminator becomes better at distinguishing real from fake.


    As studied earlier, the discriminator could be a standard CNN model that is trained to distinguish between real images and fake images as generated by the generator.


    Two types of losses are calculated, the generator loss and the discriminator loss, first a discriminator loss is calculated, and the discriminator is trained, and then the generator is trained.

    As stated, we are in a minimax game and trying to minimize the losses.

    We can write the loss as a function:

    V(D,G) = Ex~Pdata(x) [log D(x)] + Ez~Pnoise(z) [log (1 - D(G(z))]

    Here V is the loss for the model, E stands for expectation, first part is the discriminator loss independent of the generator x~Pdata(x) shows that the input is sampled from the input space distribution and its log loss is calculated whereas the second part which represents the discriminator loss dependant of the generator z~Pnoise(z) is the noise sampled from the distribution of noise (generally gaussian distribution), and its log loss is calculated, considering the fact that (y = 0) for the generator for this case.

    For the discriminator, we want to maximize this since we want it to be the most accurate. (i.e value to be near 0). Whereas for the generator we want it to generate realistic output and thus we want to maximize the error of the discriminator. (i.e. value to be highly negative).

    Now we have two options, one is to maximize the discriminator loss D and then minimize the generator loss G and the other to minimize G and then maximize D:

    minGmaxD V(D,G)= Ex~Pdata(x)[log D(x)] +Ez~Pnoise(z)[log (1 - D(G(z))]
    maxDminG V(D,G)= Ex~Pdata(x)[log D(x)] +Ez~Pnoise(z)[log (1 - D(G(z))]

    Minimising G could directly lead to inverse training of discriminator, and not necessarily good generation capabilities, which then paired with maximizing could just mean neutralizing the effects and not guarantee convergence.

    Whereas maximizing D would first train the generator to understand the difference between real and fake images and then training the generator would lead to enhanced generation capabilities. This process when re-iterated multiple times would ideally lead to infinite capabilities for the generator where the discriminator cannot differentiate between real and fake.

    This iterative process can be implemented by first training the discriminator model on one epoch then the generator and so on. (one epoch through mini-batch gradient descent)

    Note: Taking multiple steps towards convergence for the discriminator model, as in mini-batch gradient descent or running a few epochs at a time is a better option, considering the fact that the discriminator would actually have a sense of what is real and what is fake after a few steps as compared to one and generator can actually improve as compared to worsening the discriminator* and so that we do not again run into the non-convergence issue mentioned above.

    *This concept is understandable because an improved discriminator indicates that the entire model is nearing convergence, suggesting that further enhancements to the generator are likely to bring the model closer to convergence. Conversely, with an average discriminator, which suggests that the model is not yet in a convergence zone, training the generator could actually deteriorate the discriminator's performance. This is because there are numerous ways for the discriminator to worsen, making it a more probable outcome. 



    Generative Process:
    Random noise is sampled from a pre-set distribution and a generator network, could be convolutional layers, dense layers, auto-encoders, or even U-Nets are used. 

    Generally, the random noise given as input is smaller then the output size and is present just to induce variation in the generations, and could come from the motivation that while drawing random thoughts, events influence it even for humans.

    A model trained with this architecture will only produce one type of output, but if we want it to be trained on multiple types, then we can use an image/text embedding as input for the generator as well.

    The reason why GANs are inherently better than U-Nets, CNNs, and other generative models is their ability to not train on direct losses but on understanding the implicit representations of both the generative and discriminative part and use them as adversarial of one another to generate images that realistic instead of images that match the output. As stated before, provided enough computing power it could ideally lead to infinite generation capabilities. 

    For normal generative models, the process of learning on custom-defined loss functions and matching on input images from a given dataset often leads to low sharpness of the image and very averaged-out features, of the ones present in the training set. This problem is not present in GANs, since they allow multiple answers, as compared to other networks.

    An example of the low sharpness output for MSE models for the next video frame prediction, because of averaging out predictions from training cases. GANs can predict more accurately because their training is not based on given data loss, but on what could be a suitable prediction for the discriminator.








    Generating Deepfakes:

    But How are deepfakes generated?

    Two different encoder networks are trained, one to generate faces of A and another for B. Then the encoded versions of image A and B are switched between the networks, this results in encoder 1 generating image of B with face of A and vice versa for encoder 2.

    This encoder-decoder network could be a U-Net and the output can further be paired with a discriminator network, thus training the whole network as a GAN, further improving performance.

    This type of deepfake architectures have a problem, which is that networks with same architectures might have different way/method of final representation, i.e. different way of encoding, hence it is better to use a unified encoder network and separate decoder networks. 

    Further predictive masks can be used to highlight regions which need more attention to detail and then the process of reconstruction and can be used to guide the disciminator network on where to zero in while detecting flaws. For the reconstruction part the masked part can be fed in to a network trained to just reconstruct the facial features and expression, instead of the whole image.




    '''

In [3]:
tokeniser = Tokeniser()
tokens, n_toks = tokeniser.tokenise(text)

Unknown token marker is not present in the 'data' (token map). Adding it there. Setting its token id to 131072, token count to 0 and increasing 'max_token_id' to 131072


Tokenising words: 100%|██████████| 16796/16796 [00:00<00:00, 592543.78it/s]


In [4]:
tokens[:8]

['DE', 'EP', ' ', 'FAK', 'ES', '\n', ' ', ' ']

In [5]:
n_toks, len(tokens)

(18764, 18764)

In [6]:
tokens.count(' ')

8481

In [7]:
import regex

In [31]:
from tiktoken._educational import *

In [34]:
def get_word_breaks(text):
    # Load the encoding for GPT-4
    enc = tiktoken.encoding_for_model("gpt-4o")
    
    # Encode the text to get token IDs
    token_ids = enc.encode(text)
    
    # Decode each token to get its subword string
    subwords = [enc.decode([token]) for token in token_ids]
    
    return subwords

In [35]:
gpt4o_tokens = get_word_breaks(text)

In [36]:
gpt4o_tokens[:5], len(gpt4o_tokens)

(['DE', 'EP', ' F', 'AK', 'ES'], 10403)

In [37]:
pattern = regex.compile(r'[^\p{L}\p{N}]+')
bool(pattern.fullmatch('9')), bool(pattern.fullmatch('a')), bool(pattern.fullmatch('B')), bool(pattern.fullmatch(' ')), \
bool(pattern.fullmatch('/'))

(False, False, False, True, True)

In [38]:
alpha_tokens = 0
for i in gpt4o_tokens:
    for j in i:
        if not bool(pattern.fullmatch(j)):
            alpha_tokens += 1
            break
alpha_tokens

7964

In [39]:
count = 0
for i in tokens:
    if len(regex.split(r'[^\p{L}\p{N}_]+', i)) > 1:
        continue
    count += 1
count

7827

In [7]:
token_ids = tokeniser.token_ids(tokens)

In [8]:
token_ids[:10]

[46746, 29659, 29659, 67133, 131072, 118813, 29659, 111880, 92770, 115964]

In [9]:
tokeniser.max_token_id

131072

In [10]:
one_hot_np = tokeniser.one_hot_tokens(token_ids, op='np')

In [11]:
one_hot_np

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [12]:
one_hot_np.shape

(26319, 131073)

In [13]:
type(one_hot_np)

numpy.ndarray

In [14]:
del one_hot_np

In [15]:
one_hot_torch = tokeniser.one_hot_tokens(token_ids, op='torch')

In [16]:
one_hot_torch.shape

torch.Size([26319, 131073])

In [17]:
one_hot_torch

tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]])

In [18]:
type(one_hot_torch)

torch.Tensor

In [19]:
tokeniser.visualise_tokens(tokens)

'
',' ',' ','Ent','<|unknown_token|>','ity',' ','Normal','isa','tionIn',' ','Ind','ic',' ','Languages','
','
','
','
','
','
','
','
','
','
','
',' ',' ','Tas','may',' ','Pan','kaj',' ','Ti','bre','wal','
',' ',' ','1','.',' ','Introduction','
',' ',' ','Entity',' ','normalization',' ','is',' ','central',' ','to',' ','many',' ','NLP',' ','tasks','.',' ','In',' ','Ind','ic',' ','languages',',',' ','the',' ','challenge',' ','amp','lifies',' ','because',' ','we',' ','must',' ','handle',' ','multiple',' ','scripts',' ','(','Dev','ana','gari',',',' ','Tamil',',',' ','Telugu',',',' ','etc','.),',' ','plus',' ','localized',' ','words',' ','for',' ','months',',',' ','currency',',',' ','numer','ic',' ','expansions',',',' ','etc','.',' ','Our',' ','end',' ','goal',' ','is',' ','to',' ','take',' ','sentences',' ','containing',' ','dates',',',' ','currencies',',',' ','and',' ','scientific',' ','units',' ','and',' ','produce',' ','fully',' ','spelled','-','out',' ','text',' ','in',' ','the',' ','s

In [20]:
tokeniser.visualise_token_ids(token_ids)

[46746, 29659, 29659, 67133, 131072, 118813, 29659, 111880, 92770, 115964, 29659, 18456, 76330, 29659, 66911, 46746, 46746, 46746, 46746, 46746, 
46746, 46746, 46746, 46746, 46746, 46746, 29659, 29659, 113489, 95769, 29659, 130643, 88847, 29659, 81009, 8901, 71026, 46746, 29659, 29659, 
48368, 84422, 29659, 50865, 46746, 29659, 29659, 117538, 29659, 103382, 29659, 44405, 29659, 29896, 29659, 27541, 29659, 122612, 29659, 22817, 
29659, 117264, 84422, 29659, 129289, 29659, 18456, 76330, 29659, 103028, 23354, 29659, 24487, 29659, 50871, 29659, 45484, 50416, 29659, 103839, 
29659, 97661, 29659, 92364, 29659, 83920, 29659, 902, 29659, 22054, 29659, 37143, 36093, 97424, 107595, 23354, 29659, 49746, 23354, 29659, 
43446, 23354, 29659, 101053, 19698, 29659, 6973, 29659, 59932, 29659, 59604, 29659, 81638, 29659, 38023, 23354, 29659, 112669, 23354, 29659, 
44717, 76330, 29659, 71390, 23354, 29659, 101053, 84422, 29659, 24429, 29659, 126869, 29659, 13653, 29659, 44405, 29659, 27541, 29659, 16699,