# Part 4: New Model & Method as a better design for an improved sentiment analysis system

### Explaining our approach:

##### We will opt to use the max-marginal forward-backward algorithm for this part, over the Viterbi algorithm to leverage its ability to provide detailed probabilistic insights into each state's likelihood at every time step, rather than merely identifying the most likely overall sequence of states. 

#####  While both algorithms are probabilistic, the Viterbi algorithm provides a deterministic output (the single best sequence), based on probabilistic calculations. In contrast, the max-marginal forward-backward algorithm offers a broader probabilistic perspective by detailing the likelihood of each possible state at every point in the sequence, making it more suitable for applications where such detailed probabilistic information is valuable. Another key thing to note is, that the scores in the max_marginal algotrithm, actually calculateds the probability of each state, independently of the path decisions, giving a more comprehensive view of state probabilities considering all poossible paths, not just the most likely path. In the case of this project, speech tagging, the max-marginal forward backward decoding is better as with its probabilistic information, it can generally help with handling ambiguity and model confidence making it "better" than the Viterbi Algorithm.

##### To address the potential concern, that this max_marginal forward-backward algorithm might not yield better results than the Viterbi implementation we had earlier on, we couple our max_marginal forward-backward algorithm, with a new and better way to handlw unknown words, which will give us more accurate emission parameters. In this way, we will be able to yield a more accurate sequence of tags compared to part 3. And this would also have the added benefit of being able to indicate how certain we are about each state, which can be very useful sometimes.

##### We also would like to take this chance to try implementing a model other than the Viterbi Algorithm

### New model used: Max-Marginal Forward-Backward

##### We make use of the max-marginal decoding algorithm that employs the forward-backward method to calculate individual state probabilities at each time step in a sequence, by computing forward (α) and backward (β) probabilities. It selects the most likely state for each position by identifying the state that maximizes the product of these probabilities (αu(i) * βu(i)). 

### New method to smooth emission parameters: Absolute Discounting

##### Absolute discounting is a smoothing technique used in language modeling to address data sparsity issues by adjusting the counts of observed events. It works by subtracting a fixed discount factor from the counts, ensuring that no count becomes negative. This redistribution of probability mass from observed to unseen events helps the model generalize better, particularly in the presence of rare or unseen words. By allocating a portion of probability mass to unseen events, absolute discounting improves the model's ability to make accurate predictions, even in scenarios where training data is limited or incomplete. Overall, absolute discounting enhances the robustness and performance of language models by providing more reliable probability estimates and mitigating the impact of data sparsity.


### Processing The File

In [1]:
def process_file(filepath):
    # we make use of the default library "collections" to make processing the tags and word-tag pairs easier
    import collections #used for counting
    tag_count = collections.defaultdict(int)  # counting for tags
    word_tag_count = collections.defaultdict(int)  # counting for word-tag pairs
    vocabulary = set()  # stores unique words
    sentences = [] # store all the sentences
    current_sentence = []

    with open(filepath, 'r', encoding='utf-8') as file:
        # reading file line-by-line
        for line in file:
            stripped_line = line.strip() #removes the /n and then splits it to separete the word and its label
            if stripped_line:  # check if there even is a word or tag in the line
                word, tag = stripped_line.split()  # Split line into word and tag
                word_tag_count[(word, tag)] += 1
                tag_count[tag] += 1
                vocabulary.add(word) #doesnt add duplicates
                current_sentence.append(word)
            else:
                if current_sentence: 
                    # add current sentence to sentences then restart the count
                    sentences.append(current_sentence)
                    current_sentence = []
        if current_sentence:
            sentences.append(current_sentence)

    return tag_count, word_tag_count, vocabulary, sentences


#tag count : dictionary with the count of each tag e.g ('B-NP') : 45
#word_tag_count : dictionary with the count of each word-tag pair e.g ('Municipal','B-NP') : 1

tag_count, word_tag_count, vocabulary, sentences = process_file('EN/train')


In [19]:
def process_file_for_transitions(filepath):
    # we make use of the default library "collections" to make processing the tags and word-tag pairs easier
    import collections
    transition_count = collections.defaultdict(int) #y_u to y_v, including start and stop
    tag_count = collections.defaultdict(int)  # counting for tags
    vocab = set()
    
    # we still need counters for stop and start, to add them into the transition parameters
    start_counter = 0 
    stop_counter = 0
    
    START = "START"
    STOP = "STOP"
    previous_tag = START

    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            stripped_line = line.strip()
            if stripped_line:
                word, tag = stripped_line.split()
                transition_count[(previous_tag, tag)] += 1
                if previous_tag == "START":
                    start_counter += 1
                tag_count[tag] += 1
                previous_tag = tag
                vocab.add(word)
            else:  # when the sentence has ended
                transition_count[(previous_tag, STOP)] += 1
                stop_counter += 1
                previous_tag = START  # reset for the next sentence
     #adding counts for start and stop
    tag_count["START"] = start_counter
    tag_count["STOP"] = stop_counter
    
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read().strip()
    # split on double newlines which denote separated sentences in our case
    sentences = [sentence.split() for sentence in content.split('\n\n')]

    return transition_count, tag_count, sentences, vocab

def estimate_all_transition_probability(transition_count, tag_count):
  
    transition_probabilities = {}
    # iterate through all the transition tag pairs to get all the transition probabilities
    # store the results in the dictionary transition_probabilities
    for (y_u, y_v), count in transition_count.items():
        transition_probabilities[(y_u, y_v)] = count / tag_count[y_u]
        
    return transition_probabilities


# run the function to get all the transiiton probaibilities
transition_count, tag_count, sentences, vocabulary = process_file_for_transitions('EN/train')
transition_probabilities = estimate_all_transition_probability(transition_count, tag_count)

### Function to write outputs to a file

In [18]:
def get_sentences(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read().strip()
    # split on double newlines which denote separated sentences in our case
    return [sentence.split() for sentence in content.split('\n\n')]

def get_prediction(filepath, tag_count, transmission_probabilities, emission_probabilities, vocabulary):
    sentences = get_sentences(filepath)
#     print(sentences)
    predictions = []
    for sentence in sentences:
        best_path_prediction = forward_backward_decoding(sentence, tag_count, transmission_probabilities, emission_probabilities, vocabulary)
        predictions.append(list(zip(sentence, best_path_prediction))) #puts them in the predictions array pairwise
    return predictions
    

In [24]:
def write_tag_predictions_to_file(predictions, output_filepath):
    # open the output file for writing
    with open(output_filepath, 'w', encoding='utf-8') as file:
        for sentence in predictions:
            for word, tag in sentence:
                # write each word and its predicted tag to the file, with a spacing to separate.
                file.write(f"{word} {tag}\n")
            file.write("\n")

### 4a, 15 points (New method and New model)

In [48]:
def estimate_emission_probabilities_absolute_discounting(tag_count, word_tag_count, d=0.97):
    emission_probabilities = {}
    total_unique_words = sum(tag_count.values())

    for (word, tag), count in word_tag_count.items():
        adjusted_count = max(count - d, 0)
        emission_probabilities[(word, tag)] = adjusted_count / tag_count[tag]

    # Handle unseen words by assigning the discounted probability mass
    for tag in tag_count:
        unseen_prob = d * len([word for word, t in word_tag_count if t == tag]) / total_unique_words
        emission_probabilities[("#UNK#", tag)] = unseen_prob

    return emission_probabilities


In [49]:
# run the function to get emission parameters
new_emission_probabilities = estimate_emission_probabilities_absolute_discounting(tag_count, word_tag_count)

In [30]:
def forward_backward_decoding(sentence, tag_count, transition_probabilities, new_emission_probabilities, vocabulary):
    tags = [tag for tag in tag_count if tag not in ['START', 'STOP']]  # makes a dictionary of tags that doesnt include start and stop, so we dont iterate through them unncessarily
    n = len(sentence)
    m = len(tags)
    
   #replace all unknown words with the special token #UNK# so that we can properly handle the emission probabilities
    # we want to directly alter the sentence to have '#UNK#' as that's the requirement from Part 1
    for i in range(0,n):
        if sentence[i] not in vocabulary:
            sentence[i] = '#UNK#'
    
    # initialize the arrays to store the forward and backward scores
    alpha = [[float(0) for _ in range(m)] for _ in range(n)]
    beta = [[float(0) for _ in range(m)] for _ in range(n)]
    
    # Base case for forward probabilities
    for i, tag in enumerate(tags):
        alpha[0][i] = transition_probabilities.get(('START', tag), 0) * new_emission_probabilities.get((sentence[0], tag), 0)

    # Bottom up dynamimc programming to calculate forward probabilities
    for t in range(1, n):
        for j, tag in enumerate(tags):
            sum_alpha = 0
            for i, prev_tag in enumerate(tags):
                sum_alpha += alpha[t-1][i] * transition_probabilities.get((prev_tag, tag), 0) * new_emission_probabilities.get((sentence[t], tag), 0)
            alpha[t][j] = sum_alpha

    # Base case for backward probabilities
    for i, tag in enumerate(tags):
        beta[n-1][i] = transition_probabilities.get((tag, 'STOP'), 0) * new_emission_probabilities.get((sentence[t], tag), 0)

    # Bottom up dynamimc programming to calculate backward probabilities
    for t in range(n-2, -1, -1):
        for i, tag in enumerate(tags):
            sum_beta = 0
            for j, next_tag in enumerate(tags):
                sum_beta += beta[t+1][j] * transition_probabilities.get((tag, next_tag), 0) * new_emission_probabilities.get((sentence[t+1], next_tag), 0)
            beta[t][i] = sum_beta

    # Determine the best tags by finding the maximum alpha * beta product for each position
    # essentially we are looking for argmax alpha * beta at iter u
    best_tags = []
    for t in range(n):
        # initialise max score nad best tag
        max_score = -1
        best_tag = None
        for i, tag in enumerate(tags):
            score = alpha[t][i] * beta[t][i]
            if score > max_score:
                max_score = score
                best_tag = tag
        best_tags.append(best_tag)

    return best_tags


In [50]:
prediction = get_prediction('EN/dev.in', tag_count, transition_probabilities, new_emission_probabilities, vocabulary)


In [51]:
write_tag_predictions_to_file(prediction, 'EN/dev.p4.out')

In [53]:
# evaluate the scores of the max marginal forward-backward algorithm implementation
!python3 EvalScript/evalResult.py EN/dev.out EN/dev.p4.out


#Entity in gold data: 13179
#Entity in prediction: 14816

#Correct Entity : 10691
Entity  precision: 0.7216
Entity  recall: 0.8112
Entity  F: 0.7638

#Correct Sentiment : 9734
Sentiment  precision: 0.6570
Sentiment  recall: 0.7386
Sentiment  F: 0.6954


##### We can see here that actually, this implementation of max-marginal decoding does not perform as well as the viterbi algorithm that we implemented in part 3 by a little bit. Hence, we are actually sacrificing some accuracy, but this method is still better due to its probabilistic advantages mentioned above. Overall, the trade off is better in some scenarios so we are confident in this method that we have chosen

### 4b, 10 points (Evaluation using a new test set)
