# Task 0: Define a Naive Bayes' Classifier 

## We have
$$ P(word\,|\,class) = \frac{c(word\,in\,class)}{c(all\,words\,in\,that\,class)}$$
## What we need
$$ P(class\,|\,word) $$
$$ \rightarrow \frac{P(word\,|\,class) * P(class)}{P(word)} $$
## How to choose a class ?
$$class = argmax_i\,P(class_i\,|\,word)$$
$$ \rightarrow argmax_i \, \frac{P(word\,|\,class_i) * P(class_i)}{P(word)} $$
$$ \rightarrow argmax_i \, P(word\,|\,class_i) * P(class_i) $$

In [6]:
# Step 1: Get Imports and Dependencies

import os
import pandas as pd
from nltk import ngrams, FreqDist
import re

corpus_files = sorted(os.listdir('./Data'))
#print(corpus_files)

f = open('./Data/train.txt', 'r')
training_data = []
for line in f:
    training_data.append(f.readline())
f.close()

In [50]:
# This function collects the data for the Naive Bayes' Classifier. 
# The data collected includes the errenous word, the type of error and the correction.
# The count of the error type is also calculated, and given the word and error type
# Given the word and the error type, the corrections are also calculated
from operator import itemgetter

wrong_word_count = dict() # {wrong_word: count}
error_count = dict() # {error_type: count}
wrong_word_error_count = dict() #{(wrong_word, error_type): count}
error_types = list()

sorted_wrong_word_count = list()
sorted_error_count = list() 
sorted_wrong_word_error_count = list() 

def get_error_counts(file):
    # wrong_word_context = dict() #{wrong_word: [(two_words_before), (two_words_after)]}
    
    for line in file:
        if " " in line:
            spl = line.split()
            word = spl[0]
            if word not in wrong_word_count:
                wrong_word_count[word] = 0
            wrong_word_count[word] += 1
            
            err = spl[len(spl)-1]
            if err not in error_types:
                error_types.append(err)
                error_count[err] = 0
            error_count[err] += 1
            
            size = len(spl)
            if size > 1:
                spl[1:size-1] = [' '.join(spl[1:size-1])]
            
            if (word, err) not in wrong_word_error_count:
                wrong_word_error_count[(word, err)] = 0
            wrong_word_error_count[(word, err)] += 1
    for key, value in sorted(wrong_word_count.items(), key = itemgetter(1), reverse = True):
        #print ((key, value))
        sorted_wrong_word_count.append((key, value))
    for key, value in sorted(error_count.items(), key = itemgetter(1), reverse = True):
        sorted_error_count.append((key, value))
    for key, value in sorted(wrong_word_error_count.items(), key = itemgetter(1), reverse = True):
        sorted_wrong_word_error_count.append(((key), value))
    
    return 

def get_probabs(file):
    get_error_counts(file)
    

#### Skew in the Classes
* What if one class/genre has the biggest corpus ?.
* How to manage this skew ?

### Generative : A Class of Models 
* $P(Class\,|\,Data)$ is estimated by figuring out $P(Data\,|\,Class)$
* The model tries to ask, What is the chance that, I have seen this data if it came from this particular class ?.
* The model therefore, **generates** data, or more accurately, generates its estimate of the data. So, its a