# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Mrwan Alhandi
#### Student ID: s3969393

Date: 9/10/23

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used:
* pandas
* re
* numpy
* scikit-learn
* json
* Glove Embeddings were downloaded online

## Introduction
In this task, we will create different feature representations for job advertisement descriptions.The features to be generated include:

1. Bag-of-Words (BoW) Model: Generate the Count vector representation for each job advertisement description.
2. Both weighted (TF-IDF weighted) and unweighted vector representations using the Glove.

To summarize, we will create three different types of feature representationsCount Vectors, TF-IDF weighted embeddings, and unweighted embeddings.

Later in the Jupyter Notebook, different models are trained for each type of features and a cross validation is performed to assess the accuracy of each model.

## Importing libraries 

In [1]:
# Required Imports
import pandas as pd
import re
import json
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

In [2]:
def load_vocabulary(vocab_file):
    """This function loads the vocabulary from the disk"""
    vocabulary = {}
    with open(vocab_file, 'r', encoding='utf-8') as file:
        for line in file:
            word, index = line.strip().split(':')
            vocabulary[word] = int(index)
    return vocabulary

def create_count_vectorizer(vocabulary):
    """This function creates a CountVectorizer based on the loaded vocabulary"""
    count_vectorizer = CountVectorizer(vocabulary=vocabulary)
    return count_vectorizer

def generate_count_vectors(df, count_vectorizer, column):
    """This function generates and saves the Count Vector representations"""
    count_vectors = count_vectorizer.transform(df[column])
    return count_vectors


# Load all the three saved vocabularies
vocabulary = load_vocabulary('vocab.txt')
vocabulary_title = load_vocabulary('vocab_title.txt')
vocabulary_concat_feature = load_vocabulary('vocab_concat_feature.txt')


# Create CountVectorizers based on the loaded vocabularies
count_vectorizer = create_count_vectorizer(vocabulary)
count_vectorizer_title = create_count_vectorizer(vocabulary_title)
count_vectorizer_concat_feature = create_count_vectorizer(vocabulary_concat_feature)


# Read the data and convert the tokens back in a list form
preprocessed_df = pd.read_excel("./combined_data.xlsx")
preprocessed_df["tokens"] = preprocessed_df["tokens"].apply(json.loads)
preprocessed_df["title_tokens"] = preprocessed_df["title_tokens"].apply(json.loads)
preprocessed_df["concat_feature_tokens"] = preprocessed_df["concat_feature_tokens"].apply(json.loads)


# Generate and save the Count Vector representations
count_vectors = generate_count_vectors(preprocessed_df, count_vectorizer, 'description')
count_vectors_title = generate_count_vectors(preprocessed_df, count_vectorizer_title, 'title')
count_vectors_concat_feature = generate_count_vectors(preprocessed_df, count_vectorizer_concat_feature, 'concat_feature')

In [3]:
def load_glove_embeddings(path):
    """This function loads GloVe word vectors into a dictionary"""
    embeddings_index = {}
    with open(path, 'r', encoding='utf-8') as file:
        for line in file:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vector
    return embeddings_index

# Unweighted GloVe Embeddings
def generate_unweighted_glove_embeddings(vocabulary, embedding_dim, embeddings_index):
    """Function for generating unweighted glove embeddings"""
    embedding_matrix = np.zeros((len(vocabulary), embedding_dim))
    
    for word, index in vocabulary.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector
            
    return embedding_matrix

# TF-IDF Weighted GloVe Embeddings
def generate_weighted_glove_embeddings(text_data, vocabulary, embedding_dim, embeddings_index):
    """Function for generating weighted glove embeddings"""
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit(text_data)
    
    tfidf_vectors = tfidf_vectorizer.transform(text_data)
    
    weighted_glove_vectors = []
    for i in range(len(text_data)):
        words = text_data[i].split()
        weighted_vector = np.zeros(embedding_dim)
        for word in words:
            if word in vocabulary and word in tfidf_vectorizer.vocabulary_:
                tfidf_weight = tfidf_vectors[i, tfidf_vectorizer.vocabulary_[word]]
                glove_embedding = embeddings_index.get(word)
                if glove_embedding is not None:
                    weighted_vector += tfidf_weight * glove_embedding
        weighted_glove_vectors.append(weighted_vector)
    
    return weighted_glove_vectors


# Load GloVe embeddings
glove_path = "./word_embeddings/glove.6B/glove.6B.100d.txt"
embeddings_index = load_glove_embeddings(glove_path)
    
# Load the data in a variable
text_data = preprocessed_df["description"]
text_data_title = preprocessed_df["title"]
text_data_concat_feature = preprocessed_df["concat_feature"]

# Define embedding dimension
embedding_dim = 100  # Can be adjusted according to GloVe embeddings

# Generate unweighted GloVe embeddings for all three models
unweighted_glove_embeddings = generate_unweighted_glove_embeddings(vocabulary, embedding_dim, embeddings_index)
unweighted_glove_embeddings_title = generate_unweighted_glove_embeddings(vocabulary_title, embedding_dim, embeddings_index)
unweighted_glove_embeddings_concat_feature = generate_unweighted_glove_embeddings(vocabulary_concat_feature, embedding_dim, embeddings_index)


# Generate TF-IDF weighted GloVe embeddings for all three models
weighted_glove_embeddings = generate_weighted_glove_embeddings(text_data, vocabulary, embedding_dim, embeddings_index)
weighted_glove_embeddings_title = generate_weighted_glove_embeddings(text_data_title, vocabulary_title, embedding_dim, embeddings_index)
weighted_glove_embeddings_concat_feature = generate_weighted_glove_embeddings(text_data_concat_feature, vocabulary_concat_feature, embedding_dim, embeddings_index)

### Saving outputs

In [4]:
def save_count_vector_representations(filename, vectors):
    """This function saves the Count Vector representations according to the format given"""
    with open(filename, 'w', encoding='utf-8') as file:
        for webindex, count_vector in zip(preprocessed_df['webindex'], vectors):
            non_zero_indices = count_vector.nonzero()[1]
            counts = count_vector.data
            representation = [f"{index}:{count}" for index, count in zip(non_zero_indices, counts)]
            file.write(f"#{webindex},{' '.join(representation)}\n")

save_count_vector_representations('count_vectors.txt', count_vectors)
save_count_vector_representations('count_vectors_title.txt', count_vectors_title)
save_count_vector_representations('count_vectors_concat_feature.txt', count_vectors_concat_feature)

In [5]:
# For the Main Model
# Save the unweighted GloVe embeddings to a file
np.save("unweighted_glove_embeddings.npy", unweighted_glove_embeddings)

# Save the weighted GloVe embeddings to a file
np.save("weighted_glove_embeddings.npy", weighted_glove_embeddings)


# For title only
# Save the unweighted GloVe embeddings to a file
np.save("unweighted_glove_embeddings_title.npy", unweighted_glove_embeddings)

# Save the weighted GloVe embeddings to a file
np.save("weighted_glove_embeddings_title.npy", weighted_glove_embeddings)


# For concatenated feature
# Save the unweighted GloVe embeddings to a file
np.save("unweighted_glove_embeddings_concat_feature.npy", unweighted_glove_embeddings)

# Save the weighted GloVe embeddings to a file
np.save("weighted_glove_embeddings_concat_feature.npy", weighted_glove_embeddings)



# This code can be used to load the embeddings from the disk - Uncomment in case required
# # Load unweighted GloVe embeddings
# unweighted_glove_embeddings = np.load("unweighted_glove_embeddings.npy")

# # Load weighted GloVe embeddings
# weighted_glove_embeddings = np.load("weighted_glove_embeddings.npy")

## Task 3. Job Advertisement Classification

#### Model Training - Description [Main Model]

In [6]:
# With weighted glove embeddings

# Train Test Split - 80:20
X_train, X_test, y_train, y_test = train_test_split(weighted_glove_embeddings, preprocessed_df["category"], test_size=0.2, random_state=42)

# Create a Logistic Regression classifier and fit it on the data
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
clf.fit(X_train, y_train)

# For prediction
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Test Accuracy (weighted):", accuracy)

Test Accuracy (weighted): 0.7435897435897436


In [7]:
# For unweighted glove embeddings
def get_aggregated_embeddings(column):
    aggregated_embeddings = []

    for tokens in (preprocessed_df[column]):
        description_embedding = []
        for token in tokens:
            if token in vocabulary:
                index = vocabulary[token]
                embedding_vector = unweighted_glove_embeddings[index]
                description_embedding.append(embedding_vector)

        # Calculate the mean (average) embedding for the description
        if description_embedding:
            description_embedding = np.mean(description_embedding, axis=0)

        else:
            # If the description is empty or all tokens are out-of-vocabulary, use a zero vector
            description_embedding = np.zeros(embedding_dim)

        aggregated_embeddings.append(description_embedding)

    # Convert the list of aggregated embeddings to a NumPy array
    aggregated_embeddings = np.array(aggregated_embeddings)
    
    return aggregated_embeddings


# With unweighted glove embeddings
# Generate Aggregated Embeddings
aggregated_embeddings = get_aggregated_embeddings("tokens")


# Train Test Split - 80:20
X_train, X_test, y_train, y_test = train_test_split(aggregated_embeddings, preprocessed_df["category"], test_size=0.2, random_state=42)

# Create a Logistic Regression classifier and fit it on the data
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
clf.fit(X_train, y_train)

# For prediction
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Test Accuracy (unweighted):", accuracy)

Test Accuracy (unweighted): 0.8076923076923077


In [8]:
# With Count Vectors

# Train Test Split - 80:20
X_train, X_test, y_train, y_test = train_test_split(count_vectors, preprocessed_df["category"], test_size=0.2, random_state=42)

# Create and train the logistic regression model
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8846153846153846


### 5-Fold Cross Validation - Main Model ###

In [9]:
# Number of folds for cross-validation
n_folds = 5

# Initialize the classifier (e.g., Logistic Regression)
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)

# Define the feature matrices and target labels
X_unweighted_glove = aggregated_embeddings
X_weighted_glove = weighted_glove_embeddings
X_count_vectors = count_vectors 
y = preprocessed_df["category"] 

# Initialize cross-validation splitter
cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

# Perform cross-validation for each feature representation
for X, representation_name in [(X_unweighted_glove, "Unweighted GloVe"),
                               (X_weighted_glove, "Weighted GloVe (TF-IDF)"),
                               (X_count_vectors, "Count Vectors")]:
    print(f"Cross-validation results for {representation_name}:")
    
    # Perform cross-validation and calculate accuracy scores
    scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
    
    # Print accuracy scores for each fold
    for fold, score in enumerate(scores, start=1):
        print(f"Fold {fold}: {score:.4f}")
    
    # Calculate and print the mean accuracy and standard deviation
    mean_accuracy = scores.mean()
    std_accuracy = scores.std()
    print(f"Mean Accuracy: {mean_accuracy:.4f} (±{std_accuracy:.4f})")
    print("\n")

Cross-validation results for Unweighted GloVe:
Fold 1: 0.8333
Fold 2: 0.8323
Fold 3: 0.8903
Fold 4: 0.7935
Fold 5: 0.8839
Mean Accuracy: 0.8467 (±0.0360)


Cross-validation results for Weighted GloVe (TF-IDF):
Fold 1: 0.7628
Fold 2: 0.7613
Fold 3: 0.8129
Fold 4: 0.7161
Fold 5: 0.8000
Mean Accuracy: 0.7706 (±0.0340)


Cross-validation results for Count Vectors:
Fold 1: 0.8782
Fold 2: 0.8258
Fold 3: 0.8968
Fold 4: 0.8516
Fold 5: 0.9097
Mean Accuracy: 0.8724 (±0.0304)




Based on the 5-Fold Cross valiadation above, encodings from the count vectors performed the best with an accuracy of 87.2%, followed by Unweighted Glove with an accuracy of 84.67%. This addresses the first question which asks about the model comparison.

### 5-Fold Cross Validation - Title Only Model ###

In [10]:
# Modelling using title of the job advertisement
# Number of folds for cross-validation
n_folds = 5

# Initialize the classifier (e.g., Logistic Regression)
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)

# Define the feature matrices and target labels
X_unweighted_glove = get_aggregated_embeddings("title_tokens")
X_weighted_glove = weighted_glove_embeddings_title
X_count_vectors = count_vectors_title
y = preprocessed_df["category"] 

# Initialize cross-validation splitter
cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

# Perform cross-validation for each feature representation
for X, representation_name in [(X_unweighted_glove, "Unweighted GloVe"),
                               (X_weighted_glove, "Weighted GloVe (TF-IDF)"),
                               (X_count_vectors, "Count Vectors")]:
    print(f"Cross-validation results for {representation_name}:")
    
    # Perform cross-validation and calculate accuracy scores
    scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
    
    # Print accuracy scores for each fold
    for fold, score in enumerate(scores, start=1):
        print(f"Fold {fold}: {score:.4f}")
    
    # Calculate and print the mean accuracy and standard deviation
    mean_accuracy = scores.mean()
    std_accuracy = scores.std()
    print(f"Mean Accuracy: {mean_accuracy:.4f} (±{std_accuracy:.4f})")
    print("\n")

Cross-validation results for Unweighted GloVe:
Fold 1: 0.5128
Fold 2: 0.4839
Fold 3: 0.6000
Fold 4: 0.5548
Fold 5: 0.5742
Mean Accuracy: 0.5451 (±0.0418)


Cross-validation results for Weighted GloVe (TF-IDF):
Fold 1: 0.2885
Fold 2: 0.3032
Fold 3: 0.3097
Fold 4: 0.3161
Fold 5: 0.3097
Mean Accuracy: 0.3054 (±0.0094)


Cross-validation results for Count Vectors:
Fold 1: 0.6218
Fold 2: 0.5935
Fold 3: 0.6323
Fold 4: 0.5677
Fold 5: 0.6452
Mean Accuracy: 0.6121 (±0.0279)




### 5-Fold Cross Validation - Combined Data ###

In [11]:
# Modelling Using Concatenated Data
# Number of folds for cross-validation
n_folds = 5

# Initialize the classifier (e.g., Logistic Regression)
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)

# Define the feature matrices and target labels
X_unweighted_glove = get_aggregated_embeddings("concat_feature_tokens")
X_weighted_glove = weighted_glove_embeddings_concat_feature
X_count_vectors = count_vectors_concat_feature
y = preprocessed_df["category"] 

# Initialize cross-validation splitter
cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

# Perform cross-validation for each feature representation
for X, representation_name in [(X_unweighted_glove, "Unweighted GloVe"),
                               (X_weighted_glove, "Weighted GloVe (TF-IDF)"),
                               (X_count_vectors, "Count Vectors")]:
    print(f"Cross-validation results for {representation_name}:")
    
    # Perform cross-validation and calculate accuracy scores
    scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
    
    # Print accuracy scores for each fold
    for fold, score in enumerate(scores, start=1):
        print(f"Fold {fold}: {score:.4f}")
    
    # Calculate and print the mean accuracy and standard deviation
    mean_accuracy = scores.mean()
    std_accuracy = scores.std()
    print(f"Mean Accuracy: {mean_accuracy:.4f} (±{std_accuracy:.4f})")
    print("\n")

Cross-validation results for Unweighted GloVe:
Fold 1: 0.8269
Fold 2: 0.8516
Fold 3: 0.8968
Fold 4: 0.8194
Fold 5: 0.8903
Mean Accuracy: 0.8570 (±0.0318)


Cross-validation results for Weighted GloVe (TF-IDF):
Fold 1: 0.7564
Fold 2: 0.7677
Fold 3: 0.8000
Fold 4: 0.7161
Fold 5: 0.7935
Mean Accuracy: 0.7668 (±0.0300)


Cross-validation results for Count Vectors:
Fold 1: 0.8846
Fold 2: 0.8194
Fold 3: 0.9097
Fold 4: 0.8452
Fold 5: 0.9161
Mean Accuracy: 0.8750 (±0.0373)




Based on the model training, if we used only the title the accuracy of the model drops significantly. 
If title is used along with the description column, then accuracy roughly remains the same which indicates that there is small prediction power in the title column. As per question 2, in our case, even though we added more data to the description, our accuracy was not improved. This means that if data quality is not good, it does not help the model improve its predictions. Infact, this can hurt the model performance.

## Summary
The Jupyter Notebook explored three different techniques to build word vectors and later showed the results of model training in each of those cases. The 5-fold cross validation was employed in assessing the results of the model. Overall, count vectorizer proved to be a very powerful technique for the data given.