# Part 1: Estimate Emission Parameters

### Function to process the file

##### This will help to give us the tag counts, word-tag pair counts, set of words (vocabulary) and gives a list of all the sentences in the file. These different lists and counts will be used in later parts of the project

In [1]:
def process_file(filepath):
    # we make use of the default library "collections" to make processing the tags and word-tag pairs easier
    import collections #used for counting
    tag_count = collections.defaultdict(int)  # counting for tags
    word_tag_count = collections.defaultdict(int)  # counting for word-tag pairs
    vocabulary = set()  # stores unique words
    sentences = [] # store all the sentences
    current_sentence = []

    with open(filepath, 'r', encoding='utf-8') as file:
        # reading file line-by-line
        for line in file:
            stripped_line = line.strip() #removes the /n and then splits it to separete the word and its label
            if stripped_line:  # check if there even is a word or tag in the line
                word, tag = stripped_line.split()  # Split line into word and tag
                word_tag_count[(word, tag)] += 1
                tag_count[tag] += 1
                vocabulary.add(word) #doesnt add duplicates
                current_sentence.append(word)
            else:
                if current_sentence: 
                    # add current sentence to sentences then restart the count
                    sentences.append(current_sentence)
                    current_sentence = []
        if current_sentence:
            sentences.append(current_sentence)

    return tag_count, word_tag_count, vocabulary, sentences


#tag count : dictionary with the count of each tag e.g ('B-NP') : 45
#word_tag_count : dictionary with the count of each word-tag pair e.g ('Municipal','B-NP') : 1

tag_count, word_tag_count, vocabulary, sentences = process_file('EN/train')


### Function to write the output file

##### This function will write in the predictions to a specified file path, leaving a new line between sentences

In [11]:
def write_predictions_to_file(predictions, output_filepath):
    # to use this function effectively, "predictions" should be a list of sentences
    
    # open the output file for writing
    with open(output_filepath, 'w', encoding='utf-8') as file:
        for sentence in predictions:
            for word, tag in sentence:
                # write each word and its predicted tag to the file, with a spacing to separate.
                file.write(f"{word} {tag}\n")
            # add a new line, to separate sentences
            file.write("\n")

## Part 1a, 5 points (Function to estimate emission parameters)

###### We have a function to calculate the emission probability of a given word, based on the tag and word-tag pair counts. We also have another function to calculate the emission probability of all the different words present in the file.

###### We mainly make use of the latter in this project

In [2]:
def estimate_one_emission_probabilities(x, y, tag_count, word_tag_count):
    
    # get the total times y->x occurs
    word_tag_freq = word_tag_count.get((x, y), 0)
    # total times y appears
    tag_total_freq = tag_count.get(y, 1)
    
    return word_tag_freq / tag_total_freq


In [3]:
def estimate_all_emission_probabilities(tag_count, word_tag_count):
  
    emission_probabilities = {}
    # iterate through all the word tag pairs to get all the emission probabilities
    # store the results in the dictionary emission_probabilities
    for (word, tag), count in word_tag_count.items():
        # using the expression that utilises MLE
        emission_probabilities[(word, tag)] = count / tag_count[tag] 
        
    return emission_probabilities

In [4]:
# Run the function to store the emission probabilities in a variable
emission_probabilities = estimate_all_emission_probabilities(tag_count, word_tag_count)

## Part 1b, 10 points (Accounting for unknown word tokens)

###### Again, we have a function to calculate the emission probability of a given word, based on the tag and word-tag pair counts. We also have another function to calculate the emission probability of all the different words present in the file.

###### We mainly make use of the latter in this project

In [5]:
def estimate_emission_probability_with_unknown(tag, word, tag_count, word_tag_count, vocabulary, k=0.1):
    
    # total times y appears + k
    tag_total_freq = tag_count.get(tag, 0) + k
    
    # Check if the word was seen in the training set; if not, use the special UNK token
    # e(x|y) = k/(count(y)+k) if word token is UNK
    if word not in vocabulary:
        word = '#UNK#'
        word_tag_freq = k
        return word_tag_freq / tag_total_freq
    
    # get the total times y->x occurs
    word_tag_freq = word_tag_count.get((word, tag), 0)
  
    return word_tag_freq / tag_total_freq


In [6]:
def estimate_all_emission_probabilities_with_unknown(tag_count, word_tag_count, k =0.1):
  
    emission_probabilities = {}
    # iterate through all the word tag pairs to get all the emission probabilities
    # store the results in the dictionary emission_probabilities
    # accounts for when the word token x appears in the training set
    for (word, tag), count in word_tag_count.items():
        
        emission_probabilities[(word, tag)] = count / (tag_count[tag]+k)
    
    # add the emission probabilities for when the word token x is #UNK#
    for tag, count in tag_count.items():
        emission_probabilities[("#UNK#", tag)] = count / (tag_count[tag]+k)
    return emission_probabilities

In [None]:
# Run the function to override the prevous emission probabilities
emission_probabilities = estimate_all_emission_probabilities_with_unknown(tag_count, word_tag_count)

## Part 1c, 10 points (simple sentiment analysis)

##### We essentially just produce the tag, y* = arg max e(x|y) over all y for each word in a sequence. We come up with a function that predicts the best tag for 1 word, and another function that predicts the best tag for a sequence of words.

In [7]:
def predict_one_tag(word, emission_probabilities, vocabulary,tag_count):
    # use UNK token if the word is not in the vocabulary
    best_tag = None

    if word not in vocabulary:
        word = '#UNK#'
        best_tag = max(tag_count)
#         return '#UNK#'

    # initialise the variables to keep track of the best tag and its highest probability
    max_probability = -1  # Start with a very low probability

    # iterate over all possible tags for the word in the emission probabilities
    for (current_word, tag), probability in emission_probabilities.items():
        if current_word == word and probability > max_probability:
            max_probability = probability
            best_tag = tag

    return word, best_tag


In [8]:
# this function considers the fact that the file in filpath 
# has mutiple sentences separated by a new line
# thus it returns predictions in a list of sentences
def predict_all_tags(filepath, emission_probabilities, vocabulary, tag_count):
    predictions = []
    current_sentence = []
    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            word = line.strip()
            if word:  # Ensure the line is not empty
                word, best_tag = predict_one_tag(word, emission_probabilities, vocabulary, tag_count)
                current_sentence.append((word, best_tag))
            else:
                if current_sentence:
                    predictions.append(current_sentence)
                    current_sentence = []
        if current_sentence:
                    predictions.append(current_sentence)
    # what is returned here, is a list of sentences
    return predictions

In [9]:
# run the function to predict all tags in the file
predictions = predict_all_tags("EN/dev.in", emission_probabilities, vocabulary, tag_count)

In [13]:
write_predictions_to_file(predictions, "EN/dev.p1.out")

In [14]:
# evaluare the scores of the simple sentiment analysis system
!python3 EvalScript/evalResult.py EN/dev.out EN/dev.p1.out


#Entity in gold data: 13179
#Entity in prediction: 17330

#Correct Entity : 9251
Entity  precision: 0.5338
Entity  recall: 0.7020
Entity  F: 0.6064

#Correct Sentiment : 8319
Sentiment  precision: 0.4800
Sentiment  recall: 0.6312
Sentiment  F: 0.5453
