In [1]:
%pip install textdistance

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [10]:
import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter
import os

# List of all text files you want to process
file_paths = ['Vocabulary/book.txt', 'Vocabulary/alice_in_wonderland.txt', 'Vocabulary/big.txt', 'Vocabulary/shakespeare.txt']
words = []

# Loop through each file in the file_paths list
for file_path in file_paths:
    with open(file_path, 'r') as f:
        # Read all the data from the text file
        file_name_data = f.read()
        # Convert all the content to lowercase
        file_name_data = file_name_data.lower()
        # Find all the alphanumeric words
        words += re.findall(r'\w+', file_name_data)  # Use r'\w+' to match entire words

# This is our unique vocabulary
V = set(words)
print(f"Top ten words in the text are: {words[0:10]}")
print(f"Total Unique words are {len(V)}.")

Top ten words in the text are: ['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']
Total Unique words are 39168.


In the above code, you can see that we have made a list of words and now we will build the frequency of those words, which can be easily done by using the “counter function” in Python:

In [11]:
word_freq = {}  
# count the frequency of each word in the words list
word_freq = Counter(words)
#most common words 
print(word_freq.most_common()[0:10])

[('the', 97681), ('of', 48374), ('and', 47095), ('to', 35397), ('in', 27415), ('a', 27312), ('that', 16693), ('he', 14776), ('it', 14270), ('was', 13572)]


## Relative Frequency of words 

Now  we have to find probabilities of occurence for each word in the corpus, which equals the Relative Frequencies of the words:

In [12]:
probs = {}     
Total = sum(word_freq.values())    
for k in word_freq.keys():
    probs[k] = word_freq[k]/Total

## Finding Similar Words

In [13]:
def my_autocorrect(input_word):
    #convert input to lowercase
    input_word = input_word.lower()
    #check if the input word is correct 
    if input_word in V:
        return('Your word seems to be correct')
    else:
        #calculate jaccard similarity for each word: Explained in the below markdown 
        sim = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq.keys()]
        #created a df from prob dict
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        #added the similarity column into df which contains jaccard sim calculated
        df['Similarity'] = sim
        #first sorted the df on sim, then pron
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

In [23]:
my_autocorrect('diffference')

Unnamed: 0,Word,Prob,Similarity
2186,difference,4.8e-05,0.9
8377,differences,1.1e-05,0.818182
5097,indifference,2.3e-05,0.75
1700,different,0.000218,0.636364
20870,differed,7e-06,0.545455


**sim = [1 - (textdistance.Jaccard(qval=2).distance(v, input_word)) for v in word_freq.keys()]**

1. This line calculates the Jaccard similarity between the input_word and each word in the vocabulary.
2. The Jaccard distance is calculated using the textdistance library, with the qval=2 argument indicating that we're considering 2-grams (bigrams, or pairs of consecutive characters) when calculating the similarity.
3. The similarity is calculated as 1 - Jaccard distance to convert the distance to a similarity score. The smaller the distance, the more similar the words are.


In the context of the Jaccard distance, using the `qval=2` argument refers to **2-grams** (or **bigrams**), which are sequences of **two consecutive characters** within a word. Here's an explanation:

### What is a 2-gram (bigram)?
- A **2-gram** (or **bigram**) is a pair of consecutive characters in a word. For example, the word `"dog"` would have the following bigrams:
  - **'do'**
  - **'og'**

In the case of the textdistance library, setting `qval=2` means that the similarity between two words will be computed based on these pairs of consecutive characters (2-grams).

### Jaccard Similarity and Distance:
The **Jaccard similarity** measures how similar two sets are. It's defined as the size of the intersection of two sets divided by the size of their union. When applied to bigrams, the similarity is calculated as:

\[
\text{Jaccard Similarity} = \frac{\text{Number of common bigrams}}{\text{Number of unique bigrams in both words}}
\]

- **Jaccard Distance** is simply \( 1 - \text{Jaccard Similarity} \).

For example:
- Let's compare the words `"dog"` and `"dot"`.
  - The **bigrams for "dog"** are:
    - 'do', 'og'
  - The **bigrams for "dot"** are:
    - 'do', 'ot'
  
  The common bigram between `"dog"` and `"dot"` is `'do'`, so the Jaccard similarity is:

  \[
  \text{Jaccard Similarity} = \frac{1}{3} = 0.33
  \]
  The total number of unique bigrams is 3 ('do', 'og', 'ot'). Therefore, the Jaccard distance is:

  \[
  \text{Jaccard Distance} = 1 - 0.33 = 0.67
  \]

### Why use 2-grams?
- Using **2-grams (bigrams)** helps capture more specific information about the structure of words, especially for detecting typos or similar words.
- For example, the words `"dog"` and `"dot"` are similar due to their common 2-gram `'do'`, even though the last characters differ.

### Summary:
- **qval=2** in the `textdistance` library means that the function will compute the similarity between words based on **pairs of consecutive characters** (2-grams).
- This method helps to find words that are similar in terms of character sequence, even if they have small spelling differences, which is useful for autocorrection.