<a href="https://colab.research.google.com/github/AJAkil/Classical-Autocorrect/blob/main/Autocorrect_minimum_edit_distance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Import the necessary libraries

In [1]:
import re
from collections import Counter
import numpy as np
import pandas as pd

#Upload the required file to the colab environment

In [2]:
from google.colab import files
uploaded = files.upload()

Saving shakespeare.txt to shakespeare.txt


#We start writing the required methods for analyzing the texts

## First thing is first, we need to read the file and process the texts. So we find all the words in the file through regular expressions and convert the words to lower cases. We build up a word list in this way. This will act as our vocabulary for further analysis.

In [7]:
def read_and_process_data(file_name):
    """
    Input: 
        A file_name which is found in your current directory. We read in the file
    Output: 
        words: a list containing all the words in the corpus (text file you read) in lower case. 
    """
    words = [] 

    with open('shakespeare.txt','r+') as file:
        
        lines = file.readlines()
        
        for line in lines:
            for word in re.findall(r'\w+',line.lower()):
                words.append(word)
        
    return words

In [9]:
processed_words = read_and_process_data('shakespeare.txt')
vocab = set(processed_words)  
print(f"The first 15 words in the text file are: \n {processed_words[0:15]} ")
print(f"There are {len(vocab)} unique words in the vocabulary.")

The first 15 words in the text file are: 
 ['o', 'for', 'a', 'muse', 'of', 'fire', 'that', 'would', 'ascend', 'the', 'brightest', 'heaven', 'of', 'invention', 'a'] 
There are 6116 unique words in the vocabulary.


## We write a method that gives us the frequency of the words in the text. This is done using the Counter() module of python, where it takes the list of vocabs and finds their frequency and gives us a dictionary containing that information.

In [14]:
def get_count(processed_words):
    '''
    Input:
        processed_words: a set of words representing the corpus that has been processed.
    Output:
        A dictionary where key is the word and value is its frequency(number of times it has occured in the corpus).
    '''
    
    return {key: value for key,value in Counter(processed_words).items()}

In [24]:
word_freq_dict = get_count(processed_words)
print(f"There are {len(word_freq_dict)} key values pairs in the word_count_dictionary")

for k,v in [(k,v) for k,v in word_freq_dict.items()][:5]:
  print(f"The count/frequency for the word {k} is: {v}")

There are 6116 key values pairs in the word_count_dictionary
The count/frequency for the word o is: 157
The count/frequency for the word for is: 474
The count/frequency for the word a is: 757
The count/frequency for the word muse is: 18
The count/frequency for the word of is: 1094


## Next we calculate the probability of the word in the corpus. To do this, we simply divide the number of times the word has occured in the corpus(i.e the frequency) by the total number of words in the corpus.

In [53]:
def get_probability(word_freq_dict):
    '''
    Input:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    Output:
        probs: A dictionary where keys are the words and the values are the probability that a word will occur. 
    '''

    size_of_word_dict = np.sum([v for v in word_freq_dict.values()])
    return {key: value/size_of_word_dict for key, value in word_count_dict.items()}

In [55]:
probabilities = get_probability(word_freq_dict)
print(f"Length of probs is {len(probabilities)}")

for k,v in [(k,v) for k,v in probabilities.items()][:5]:
  print(f"The probability of the word {k} is: {v}")

Length of probs is 6116
The probability of the word o is: 0.0029283396127877045
The probability of the word for is: 0.008840974372365426
The probability of the word a is: 0.01411944641325027
The probability of the word muse is: 0.000335733204013877
The probability of the word of is: 0.020405118066176745
