***K-grams***

A sequence of k consecutive items in a text or data sequence, such as words or letters, used in fields like natural language processing (NLP) for tasks like spelling correction and language modeling

In [1]:
import nltk
from nltk.util import ngrams

In [2]:
d3 = "I do not like to  eat "

In [3]:
# Step 1: Clean string and replace spaces with '$'
d3New = "$" + d3.strip().replace(" ", "$") + "$"

# Step 2: Character-level bigrams (k=2)
kgrams1 = ngrams(d3New, 2)
kgramArray = ["".join(kGramTuple) for kGramTuple in kgrams1]

print("Character-level bigrams:")
print(kgramArray)

# Step 3: Word-level bigrams (k=2)
d3Ngram = d3.split()  # split sentence into words
nGramArray = ngrams(d3Ngram, 2)

print("\nWord-level bigrams:")
for n in nGramArray:
    print(n)


Character-level bigrams:
['$I', 'I$', '$d', 'do', 'o$', '$n', 'no', 'ot', 't$', '$l', 'li', 'ik', 'ke', 'e$', '$t', 'to', 'o$', '$$', '$e', 'ea', 'at', 't$']

Word-level bigrams:
('I', 'do')
('do', 'not')
('not', 'like')
('like', 'to')
('to', 'eat')


In [4]:
d1 = "I am Sam"
d2 = "sam I am"

d1new = d1.lower()
d2new = d2.lower()
# split the sentence into words and create a list
# convert the list into a set (this removes duplicates)
# find the number of elements from a intersection b
# find the number of elements from a union b
# this division of these values gives Jaccard's coefficient
d1Set = set(d1new.split())
d2Set = set(d2new.split())

j = len(d1Set.intersection(d2Set))/len(d1Set.union(d2Set))
j

1.0

The Levenshtein algorithm, also known as edit distance, is a string metric used to measure the difference between two sequences of characters (strings).


 It quantifies this difference by calculating the minimum number of single-character edits required to transform one string into the other.
 These single-character edits can be:

*   Insertions: Adding a character.
*   Deletions: Removing a character.
*   Substitutions (or replacements): Changing one character into another.






The algorithm typically assigns a cost of 1 to each of these operations. The Levenshtein distance between two words is the total minimum cost of these operations needed for the transformation.

In [5]:
pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Downloading levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-Levenshtein)
  Downloading rapidfuzz-3.13.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (159 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.9/159.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.13.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages:

In [6]:
from Levenshtein import distance
# from nltk.metrics.distance import edit_distance

d1 = "sitting"
d2 = "kicking"

distance(d1, d2)

3

The Soundex algorithm is a phonetic algorithm designed to index names by sound, rather than by spelling, primarily for English names. Its goal is to encode homophones (words that sound the same but are spelled differently) to the same representation, enabling matching despite minor spelling variations.


The algorithm generates a four-character code:
Retain the first letter: of the name.
Map remaining consonants to digits: based on their phonetic similarity:

*   1: B, F, P, V
*   2: C, G, J, K, Q, S, X, Z
*   3: D, T
*   4: L
*   5: M, N
*   6: R


Remove vowels (A, E, I, O, U, Y) and the letters H and W: from the mapped digits.

Remove consecutive duplicate digits: (digits that are the same and appear next to each other).

Truncate or pad with zeros: to achieve a four-character code. If the code is longer than three digits after the initial letter, truncate it. If it's shorter, pad with zeros to reach three digits.

In [7]:
def soundex(word: str) -> str:
    """
    Soundex implementation (slide version):
    1) Retain first letter (uppercase).
    2) Change A,E,I,O,U,H,W,Y -> '0'
    3) Map letters to digits:
       B,F,P,V -> 1
       C,G,J,K,Q,S,X,Z -> 2
       D,T -> 3
       L -> 4
       M,N -> 5
       R -> 6
    4) Remove pairs of consecutive duplicate digits.
    5) Remove all zeros.
    6) Pad with trailing zeros and return first four characters: LDDD
    """
    if not word:
        return "0000"

    w = word.strip()
    if not w:
        return "0000"

    first = w[0].upper()

    groups = {
        'B': '1', 'F': '1', 'P': '1', 'V': '1',
        'C': '2', 'G': '2', 'J': '2', 'K': '2',
        'Q': '2', 'S': '2', 'X': '2', 'Z': '2',
        'D': '3', 'T': '3',
        'L': '4',
        'M': '5', 'N': '5',
        'R': '6'
    }
    zeros = set("AEIOUHWY")

    # Step 2 & 3: map to digits or '0' (skip the first letter)
    encoded = []
    for ch in w[1:].upper():
        if ch in zeros:
            encoded.append('0')
        else:
            encoded.append(groups.get(ch, ''))  # non-letters map to ''

    # Step 4: remove consecutive duplicate digits (including '0' duplicates)
    dedup = []
    last = None
    for d in encoded:
        if d == '':              # ignore characters outside A-Z
            continue
        if d != last:
            dedup.append(d)
        last = d

    # Step 5: remove zeros
    dedup_no_zeros = [d for d in dedup if d != '0']

    # Compose and pad (Step 6)
    code = first + "".join(dedup_no_zeros)
    return (code + "0000")[:4]


In [8]:
print(soundex("Thenura"))
print(soundex("Ravidu"))

T560
R130


<!-- retriving a value from a dictionary using the key -->
<!-- adding values to an array of element -->
<!-- iterating through elements in an list -->
<!-- removing zeros from a  string using list comprehension -->
<!-- list to join string -->