# Assignment 2
#CSE5NLP - Term 3 - 2023

This assignment contains a total of three problems/tasks. The mark distribution for these tasks are as follows:

* Problem 1: 20 marks
* Problem 2: 30 marks
* Problem 3: 50 marks

Write down your name and student ID below: 

Name: **Benjamin Kereopa-Yorke**

Student ID: **21340711**

Please read the following statement before getting started: 

*A key purpose of this assessment task is to test your own ability to complete the assigned tasks. Therefore, the use of ChatGPT, AI tools or chatbots with similar functionality is prohibited for this assessment task. Students who are found to be in breach of this rule will be subject to normal academic misconduct measures. Additionally, students may be engaged to provide an oral validation of their understanding of their submitted work (e.g., coding).*

## Problem 1 TF-IDF

Implement TF-IDF using using Python, Numpy, Pandas and whatever text cleaning library required.

The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics, you can use the following formulas.

### Term Frequency
$$tf_{t,d} = \log_{10}(count(t,d) +1)$$ 

* $tf_{t,d}$ is the frequency of the word t in the
document d

### Inverse Document Frequency
$$idf_t = \log_{10}(\frac{N}{df_t})$$

* $N$ is the total number of documents
* $df_t $ is the number of documents in which term t occurs

### TF-IDF
$$tf\text{-}idf_{t,d} = tf_{t,d} \times idf_t $$

### What is expected? 
Your implementation should include the following two functions:
 * `compute_tfidf_weights(train_docs)`
 * `word_tfidf_vector(word, tf_df, idf_df)`

To revise what TF-IDf is, you can revise the lecture notes and the further reading under Week 7. 


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

# Function to compute TF-IDF weights for a given set of documents
def compute_tfidf_weights(train_docs):
    # Initialize TfidfVectorizer with specific parameters
    vectorizer = TfidfVectorizer(norm=None, smooth_idf=False)
    
    # Fit the vectorizer to the training documents and transform the documents into their vector representation
    tfidf_matrix = vectorizer.fit_transform(train_docs)
    
    # Convert the TF-IDF matrix into a DataFrame for easier manipulation
    docs_tf_idf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
    
    # Compute the Term Frequency (TF) for each word in the documents
    docs_tf = docs_tf_idf.applymap(lambda x: x if x==0 else np.log10(x)+1)
    
    # Compute the Inverse Document Frequency (IDF) for each word in the documents
    docs_idf = pd.DataFrame([vectorizer.idf_]*len(train_docs), columns=vectorizer.get_feature_names_out())
    
    # Return the TF and IDF DataFrames
    return docs_tf, docs_idf

# Function to compute the TF-IDF vector for a given word
def word_tfidf_vector(word, tf_df, idf_df):
    # Check if the word is present in the documents
    if word in tf_df.columns:
        # Compute the TF-IDF value for the word by multiplying its TF and IDF values
        tf_idf_value = tf_df[word] * idf_df[word]
        
        # Return the TF-IDF value as a Numpy array
        return tf_idf_value.to_numpy()
    else:
        # If the word is not found in the documents, print a message and return None
        print("Word not found in training documents.")
        return None


## Problem 2 POS for classification

Robots and chat bots receive different commands to do certain tasks. 

Write a simple pragram that receive interactions in the form of a sentence and return:
* A tuple of (command, object) if the sentence is a command
* None if the sentence is not a command

To write this function, you can utilize a Part-of-speech tagger or named-entity recognizer from libraries like NLTK and Spacy.

Consider the following EXAMPLE sentences:

* Commands:
  * Grab the book
  * Fetch the ball
  * Open the jar
  * Can hand this spoon to John?

* Not commands:
  * Hey, how is it going?
  * How is your day today?
  * Do you like the weather?
This list is not exhaustive, your function should be able to handle more cases. 

Expected outcome:

1. A function that performs the task
2. If your function has limitations, highlight those limitations with examples. You are not required to submit a different file. Write your answer in a 'Text' block in this notebook. 

In [16]:
# Import necessary library
import nltk

# Download necessary NLTK data
nltk.download('punkt') # This line downloads the 'punkt' package, which includes a pre-trained Punkt tokenizer for several languages. This is used for word tokenization.
nltk.download('averaged_perceptron_tagger') # This line downloads the averaged perceptron tagger that is pre-trained on English news text. It is used for part-of-speech tagging.

# Define a function to extract command from a sentence
def extract_command():
    '''
    This function requests user input in form of a sentence.
    It tokenises the sentence into words, assigns Part of Speech (POS) tags to each word,
    and then returns the first verb-noun pair it encounters.
    
    Output arguments:
        tuple (verb, noun) if the sentence is a command
        None if the sentence is not a command or no verb-noun pair is found
    '''

    # Request user input
    sentence = input("Please enter a sentence: ") # The input is expected to be a sentence in English.

    # Tokenise the sentence into words
    tokens = nltk.word_tokenize(sentence) # nltk.word_tokenize() function splits the input sentence into individual words (tokens).
    
    # Perform POS tagging
    pos_tags = nltk.pos_tag(tokens) # nltk.pos_tag() function assigns POS tags to each token. 

    # Initialise variables to hold the command and object
    command = None
    obj = None

    # Loop through each word and its corresponding POS tag
    for word, tag in pos_tags:
        # Check if the tag indicates a verb. If yes, assign the word to 'command'.
        if tag.startswith('VB'): # 'VB' is the POS tag for verbs in base form.
            command = word
        # Check if the tag indicates a noun. If yes, assign the word to 'obj'.
        elif tag.startswith('NN'): # 'NN' is the POS tag for nouns in singular form.
            obj = word
            # Check if we have found both a command (verb) and an object (noun).
            if command is not None: # If yes, return the pair as a tuple.
                return (command, obj)
            
    # If no command-object pair was found in the sentence, return None.
    return None

# Run the function and print its output
print(extract_command()) # This line calls the function and prints the verb-noun pair if found, else prints 'None'.



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Please enter a sentence: Grab me a book
('Grab', 'book')


While the `extract_command()` function provides a simplistic way to parse commands from user inputs, it has several limitations rooted in its assumptions and implementation details.

1. **Order of Command-Object:** The function expects the verb (command) to appear before the noun (object) in a sentence. In languages or sentence constructions where this order is reversed, or in cases where additional sentence elements intervene between the verb and noun, the function will not correctly identify the command.

2. **Limited Understanding of Sentence Structure:** The function only uses a rudimentary form of sentence parsing, looking for the first verb and noun without considering the overall sentence structure. This can lead to errors in complex sentences with multiple verbs and nouns.

3. **Ambiguity Resolution:** In the presence of multiple verbs or nouns, the function simply takes the first one that fits its command-object pattern. It does not have the ability to resolve ambiguities or to consider user intent.

4. **Lack of Contextual Understanding:** The function does not take into account the contextual meaning of words. For instance, it would not distinguish between a command and a question or a statement, as long as there's a verb and a noun in sequence. For example, the question "Can you open the door?" would be interpreted as the command "open door", which may not be the desired interpretation.

5. **Tagging Accuracy:** The function's accuracy heavily depends on the accuracy of NLTK's POS tagging, which is not perfect and may vary based on the language and specific domain of the text.

6. **Lack of Error Handling:** If the user inputs something that is not a sentence (e.g., random symbols or characters), the function may return unexpected results or even raise an error.

7. **Lack of Multi-word Object Handling:** The function only captures single-word objects. If the object of a command consists of multiple words (e.g., "my green bag"), the function will only capture the first noun ("green").

Given these limitations, while this function can serve as a starting point for command extraction, it would need to be significantly enhanced and adapted for more advanced or specific use cases in Natural Language Understanding.

## Problem 3 Word embedding as features for classification

### Task
Implement a sentiment classifier based on Twitter data to analyse the sentiments of COVID-19 tweets.  

Train and test multiple classification model using necessary libraries with the features being sentence embeddings of tweets. 

Report the accuracy and F1 score (micro- and macro-averaged) for multiple classifier and discuss the differences. 

### Dataset
The dataset have been provided in Dataset.zip file with the assignment. You are required to use the original tweet text for this classification task. 

### Tweet representation
After necessary pre-processing of the tweets, convert the words into their embeddings, then take the mean of all the word vectors in a tweet to end up with a single vector representing each tweet. The tweet vector is then used for sentiment classification.

In the process of finding the embeddings for each word, you can ignore out-of-vocabulary words.

### Embedding choice
For embedding, you can use GloVe embeddings using Gensim. A sample code is give below, and more information can be found from lab resources on embeddings. 

However, this is a suggested option. You can use any word embedding of your choice, for example, word2vec, TF-IDF, etc., from any library of your choice.   

### Classifier choice
You are required to implement the following classifiers: 
* One tradition classification model (not a neural network based model)
* One classifier based on any neural network based model. 

You can use PyTorch/TensorFlow/scikit-learn to implement your classifier. However, you are free to develop a classifier from scratch. 

### Your answer must include the following: 
1. Code for data loading, data pre-processing, training, and testing of the models.  
2. A discussion on the comparison between the classifiers based on classifier accuracy and F1 score.

### Suggestion (Optional)
Consider saving a cleaned up version of the dataset after creating the embeddings to a file which can be loaded and used for further experimentation. 

In [14]:
# Imports all of the libraries I want to use to solve the problem space
import nltk
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.scripts.glove2word2vec import glove2word2vec

# Download the necessary NLTK components as this notebook and assignment work were done in Google Colab environments which are often deprecated or barebones
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

# Load the training data and the test data, ensuring that the encoding for special characters like those present in these datasets doesn't throw errors
train_df = pd.read_csv("Corona_NLP_train.csv", encoding='iso-8859-1')
test_df = pd.read_csv("Corona_NLP_test.csv", encoding='iso-8859-1')

# Preprocess text
def preprocess_text(text):
    # Lowercase and tokenise
    tokens = word_tokenize(text.lower())
    # Remove stopwords and return
    return [word for word in tokens if word not in stop_words and word.isalpha()]

train_df['OriginalTweet'] = train_df['OriginalTweet'].apply(preprocess_text)
test_df['OriginalTweet'] = test_df['OriginalTweet'].apply(preprocess_text)

# Load the GloVe model and convert it to word2vec format. I manually downloaded the 6B GloVe model for use in other research and learning besides this, instead of other methods I could have used
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.word2vec.txt'

glove2word2vec(glove_input_file, word2vec_output_file)

# Then I can load the converted file:
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

# Creates a vector for each tweet in the dataset 
def get_vector(words):
    word_embeddings = [glove_model[word] for word in words if word in glove_model]
    if len(word_embeddings) == 0:
        return np.zeros(glove_model.vector_size)
    else:
        return np.mean(word_embeddings, axis=0)

train_df['vector'] = train_df['OriginalTweet'].apply(get_vector)
test_df['vector'] = test_df['OriginalTweet'].apply(get_vector)

# Encode the labels
label_encoder = preprocessing.LabelEncoder()
train_df['Sentiment'] = label_encoder.fit_transform(train_df['Sentiment'])
test_df['Sentiment'] = label_encoder.transform(test_df['Sentiment'])

# Here below I will train the classifiers. There is an assignment requirement to use two classifiers, one that is not neural network based.

# First Classifier = Traditional classifier: Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(list(train_df['vector']), train_df['Sentiment'])

# Second Classifier =  Neural network-based classifier: Multi-layer Perceptron
mlp = MLPClassifier()
mlp.fit(list(train_df['vector']), train_df['Sentiment'])

# Test the classifiers and report the accuracy and F1 scores
for classifier in [lr, mlp]:
    y_pred = classifier.predict(list(test_df['vector']))
    accuracy = accuracy_score(test_df['Sentiment'], y_pred)
    f1_micro = f1_score(test_df['Sentiment'], y_pred, average='micro')
    f1_macro = f1_score(test_df['Sentiment'], y_pred, average='macro')

    # This prints the classifiers accuracy and F1 scores for easy evaluation by the user
    print(f'Classifier: {classifier.__class__.__name__}\nAccuracy: {accuracy}\nF1 Score (Micro): {f1_micro}\nF1 Score (Macro): {f1_macro}\n')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  glove2word2vec(glove_input_file, word2vec_output_file)


Classifier: LogisticRegression
Accuracy: 0.4426013691416535
F1 Score (Micro): 0.4426013691416535
F1 Score (Macro): 0.4476202279106887

Classifier: MLPClassifier
Accuracy: 0.4439178515007899
F1 Score (Micro): 0.4439178515007899
F1 Score (Macro): 0.45809377681851765





Analysing the results from the two classifiers, **Logistic Regression and MLPClassifier**, it's apparent that both models **exhibit somewhat similar performance levels**, albeit with some nuanced differences.

**Accuracy Score**: The accuracy score indicates the overall correct predictions made by the model over all kinds predictions made. The accuracy of the Logistic Regression model is approximately 0.4426 while that of the MLPClassifier is slightly higher at approximately 0.4439. Although the difference is quite small, it signifies that the MLPClassifier is slightly better at making correct predictions for this specific dataset.

**F1 Score (Micro)**: The Micro F1 Score computes the F1 Score by considering total true positives, false negatives, and false positives (irrespective of the class). Here, both models exhibit the same F1 Score (Micro) as their accuracy. This is often the case when dealing with binary classification problems or multi-class classification problems treated as binary classification ones (where averaging is performed over samples instead of classes). Similar to the accuracy, the MLPClassifier shows a marginally better performance with an F1 Score (Micro) of about 0.4439 versus the Logistic Regression model's score of 0.4426.

**F1 Score (Macro)**: The Macro F1 Score calculates the F1 Score for each class independently and then takes the average (which treats all classes equally). It is a better metric when dealing with imbalanced datasets. For the Macro F1 Score, the MLPClassifier outperforms the Logistic Regression model by a wider margin - 0.4581 compared to 0.4476. This suggests that, on average, the MLPClassifier might be more capable of handling all classes in the dataset, particularly if there is a class imbalance.

**Overall, based on the provided metrics, it would appear that the MLPClassifier has a slight edge over the Logistic Regression model for this specific dataset**. However, these are raw performance metrics and may not fully reflect the practical suitability of the model. Factors such as training time, interpretability, and computational resources could play a crucial role in deciding the most appropriate model for a given task. Furthermore, it's important to remember that the choice of evaluation metrics should be aligned with the specific objectives and context of the task. For instance, in some scenarios, precision or recall could be more important than overall accuracy.